A few bug fixes and enhancements for evaluating classifiers on large datasets

japerk / nltk-trainer

Train NLTK objects with zero code

http://nltk-trainer.readthedocs.org/en/latest/

Apache License 2.0

747 stars 225 forks source link

A few bug fixes and enhancements for evaluating classifiers on large datasets #17

Closed kecaps closed 10 years ago

kecaps commented 10 years ago

I found your nltk-trainer, and it worked great as a basis for comparing different classifiers for my project. While working on it, there were a few bugs that I fixed and some pain points I had in dealing with large datasets. I changed some code to use generators rather than lists for intermediate processing, and I refactored the code to only read in the dataset once and changed it to only score word features based on the training set rather than the test set.

fayimora commented 10 years ago

ping @japerk

japerk commented 10 years ago

Thanks for the updates @kecaps. If you have the time, I'd really appreciate more tests in tests/train_classifier.sh, especially for multi binary classifiers. This uses http://github.com/bmizerany/roundup to check script output. Also, any functions you want to extract for use elsewhere can be put in a module in the nltk_trainer package, or one of the subpackages (like featx).