nieldlr / hanzi

HanziJS is a Chinese character and NLP module for Chinese language processing for Node.js
http://hanzijs.com
MIT License
376 stars 56 forks source link

Add a basic longest match based segmenter function #24

Closed nikvdp closed 9 years ago

nikvdp commented 9 years ago

Added a basic longest match based segmenter, and made a few tweaks to package.json so that hanzi.js can be used in the browser via browserify.

I used coffeescript but if you don't want to include coffeescript as a dependency let me know, would be happy to rewrite in plain js!

nieldlr commented 9 years ago

Hi @nikvdp,

thanks so much for this awesome feature PR! Segmentation has been on my mind for a while now. Initially I thought it might be trickier to implement, but you've just shown a very cool simple method here. Although I think Chinese sentence segmentation can be quite complex (an example of such a guide: http://www.cis.upenn.edu/~chinese/segguide.3rd.ch.pdf), I do feel giving developers/users different options is perfect for a module like Hanzi!

I'm super keen to add this one in!

My thinking on using coffeescript, would be to keep the module consistent in plain JS. It feels a bit odd for me to mix a repo between the two. What do you think? Would you be ok if we made this plain-js only? Keen to hear your thoughts here.

Also, nice job on the browserify add-on! I'd love to hear how that works? I use browserify at work with other front-end packages, but how does it work here in this case, where we are pulling in big dictionary files? Does this get bundled with the package?

Thanks again for all your time here @nikvdp. Sorry for late reply on this one. :)

nikvdp commented 9 years ago

Hey @nieldlr,

Glad to hear back from you! Yup, segmentation in Chinese can get pretty crazy. This approach isn't the most accurate but it's good enough for most purposes. There's also the mmseg algorithm (and even a node.js version!) but in my extremely informal tests this longest match approach seemed to do just as well and was a whole lot easer to implement/integrate. Totally agree about having more choices though! Maybe we can implement a few other segmenters going forward.

As for coffeescript, yeah I agree. I love coffeescript and find it much quicker to play around with, but mixed repos can be confusing. I'll get to work on converting this to vanilla js!

For browserify support I didn't have to change much. I just added brfs to package.json which causes browserify to inline all the dictionary and data files and produce one monstrous bundle.js file with everything inside. It's pretty heavy for downloading (7-9 mb if I remember correctly), but I was looking at using this in a Chrome extension, and it works well that way.

nikvdp commented 9 years ago

OK, converted to plain js! Let me know if there's anything else you'd like me to change

nieldlr commented 9 years ago

Woop! This looks great @nikvdp :) I'm merging this in and bumping the repo version soon.

Yeah, the mmseg implementations look very interesting. There's also this one I found: node-segment. Will definitely take a look at these in the future to build out the segmenting options!

Thanks for explaining the browserify build there. I remember playing around with PhoneGap ages ago to make HanziCraft available as an offline mobile app. I switched the file sections for jQuery ajax calls to local files and that seemed to work. Might be something to think about if you want to use it on the web with a smaller size footprint :)