mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
https://mimno.github.io/Mallet/
Other
990 stars 344 forks source link

Unsupervised learning using K-Means is not usable #16

Open mommi84 opened 9 years ago

mommi84 commented 9 years ago

Since unsupervised learning does not need labels, I suppose that all instances shall be contained in a single cluster before being parsed. Unfortunately, calling Clusterings2Clusterings generates the following exception:

$ java -cp dist/*:lib/* cc.mallet.cluster.tui.Clusterings2Clusterings --input text.clusterings --training-proportion 0.5 --output-prefix text.clusterings
number clusterings=1
Exception in thread "main" java.lang.IllegalArgumentException: Number of labels must be strictly positive.
    at cc.mallet.cluster.Clustering.<init>(Clustering.java:41)
    at cc.mallet.cluster.util.ClusterUtils.createSingletonClustering(ClusterUtils.java:107)
    at cc.mallet.cluster.tui.Clusterings2Clusterings.createSmallerClustering(Clusterings2Clusterings.java:141)
    at cc.mallet.cluster.tui.Clusterings2Clusterings.main(Clusterings2Clusterings.java:118)
ghost commented 9 years ago

Why not give them all the same "default" label? Then the number of labels will be 1>0. Would that work?

mommi84 commented 9 years ago

In this example, every file represents an instance. The label of an instance is the folder (or clustering) where the file is located, therefore they all already had the same "default" label.