japerk / nltk-trainer

Train NLTK objects with zero code
http://nltk-trainer.readthedocs.org/en/latest/
Apache License 2.0
747 stars 225 forks source link

Added support for ngrams of length > 2 #3

Closed kisabaka closed 13 years ago

kisabaka commented 13 years ago

Currently train_classifier only supports unigrams and bigrams. I added support for ngrams of length > 2 but kept the current behaviour of the --bigrams flag.

japerk commented 13 years ago

Thanks for this. The only thing I don't like is the nmin_one arg (though I do understand why you added it). But I think it'd be better to support multiple ngram flags, like --ngrams 1 --ngrams 2 --ngrams 3 ... Since argparse supports an "append" option, you up for adding that?

kisabaka commented 13 years ago

I've changed the action of ngrams to "store". Have a look at https://github.com/dougk7/nltk-trainer/blob/master/train_classifier.py#L189 and let me know what you think

japerk commented 13 years ago

Pretty slick, though I don't want to lose --bigrams. Maybe --bigrams implies --ngrams 1 --ngrams 2, unless --ngrams is already given. If --ngrams and --bigrams are both given, then it could either raise ValueError (which is something I do in the other train_*.py scripts when two or more args are incompatible) or --bigrams would be equivalent to --ngrams 2. Thoughts? And I'm happy to add --bigrams back in myself if you're not interested, though it won't be for at least a week.

japerk commented 13 years ago

Ok, you're already ahead of me updating the documentation :) And --ngrams 1 --ngrams 2 isn't that much extra typing compared to --bigrams, so you can ignore most of my previous comment. And thanks again!