louismullie / treat

Natural language processing framework for Ruby.
Other
1.37k stars 127 forks source link

Is it possible to add words or correct words in the dictionary (or document it)? #93

Open ojak opened 9 years ago

ojak commented 9 years ago

There are words that are missing or mis-identified by the language parser. Is there a way to add a word to the parsing dictionary? If not, what would be the best way to handle such cases?

For example, with default settings, the word _spicy_ is tagged as FW (foreign word):

> sentence("This is a spicy pepper.").apply(:tokenize, :category).words[3]
=> Word (70319169769500)  --- "spicy"  ---  {:tag=>"FW", :category=>"unknown"}   --- []
louismullie commented 9 years ago

The best way would be to build a custom dictionary and search/replace for the specific words. Currently, you're using the default tokenizer (which is :lingua). You could also try with alternate taggers (:brill or :stanford). The specifics of each tokenizer are abstracted away from the interface, so "Adding a word to the parsing dictionary" dictionary would require creating a base class for each tagger (https://github.com/louismullie/treat/tree/master/lib/treat/workers/lexicalizers/taggers) that would handle an :override_tags option and plugging it into the initialize methods of the child classes.

ojak commented 9 years ago

OK, thanks. I'll look into that approach and let you know how it goes.