mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
https://mimno.github.io/Mallet/
Other
984 stars 344 forks source link

How to use the word stemming function in Mallet #208

Closed Yan-LCAS closed 1 year ago

Yan-LCAS commented 1 year ago

This is my command: mallet import-file --input myfile.mallet --output myfile-stemming.mallet --token-regex '[\p{L}\p{M}]+' --keep-sequence --use-pipe-from class\cc\mallet\pipe\TokenSequence2PorterStems.class

This is the reply: java.io.StreamCorruptedException: invalid stream header: CAFEBABE at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:987) at java.base/java.io.ObjectInputStream.(ObjectInputStream.java:414) at cc.mallet.types.InstanceList.load(InstanceList.java:821) at cc.mallet.classify.tui.Csv2Vectors.main(Csv2Vectors.java:146) Exception in thread "main" java.lang.IllegalArgumentException: Couldn't read InstanceList from file class\cc\mallet\pipe\TokenSequence2PorterStems.class at cc.mallet.types.InstanceList.load(InstanceList.java:830) at cc.mallet.classify.tui.Csv2Vectors.main(Csv2Vectors.java:146)

How can it be fixed?

mimno commented 1 year ago

First, we strongly discourage the use of Porter stemmers. It's almost certainly not doing what you expect.

The --use-pipe-from flag expects the argument to be a serialized sequences file, so that you can repeat a complete import process. It doesn't add a specific pipe. You're giving it a compiled class. You would need to make a copy of CsvToVectors and add the stemmer class to the pipe sequence.

Yan-LCAS commented 1 year ago

Thanks