Combining features from bulk-load and import-file

mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

https://mimno.github.io/Mallet/

Other

984 stars 344 forks source link

Combining features from bulk-load and import-file #180

Open AADeLucia opened 4 years ago

AADeLucia commented 4 years ago

There are features that are available in bulk-load that are not in import-file and vice versa:

bulk-load allows pruning (very handy)
import-file allows custom regex patterns
import-file allows extra stopwords

I find these features very handy. Are there any plans to combine some of the features?

mimno commented 4 years ago

Good question. Adding a vocabulary builder step that doesn't write instance files might make pruning easier for very large data sets. Not allowing regexes is a big part of what made bulk-loader fast, but this may have changed. For stopwords you can always start with the default English list and add to that for bulk-load.