mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
https://mimno.github.io/Mallet/
Other
984 stars 344 forks source link

Combining features from bulk-load and import-file #180

Open AADeLucia opened 4 years ago

AADeLucia commented 4 years ago

There are features that are available in bulk-load that are not in import-file and vice versa:

I find these features very handy. Are there any plans to combine some of the features?

mimno commented 4 years ago

Good question. Adding a vocabulary builder step that doesn't write instance files might make pruning easier for very large data sets. Not allowing regexes is a big part of what made bulk-loader fast, but this may have changed. For stopwords you can always start with the default English list and add to that for bulk-load.