Label and Cluster evaluation?

nikkisingh111333 commented 3 years ago

hey,I m playing around with k means clustering and at evaluation time i was playing with Old friend Iris data.so i m new to unsupervised learning heres my question: how to evaluate cluster with label column 'species in case of iris' heres what i m doing simple stuff for getting familiar with clustering in tribuo:

Map<String,FieldProcessor> fieldProcessors = new HashMap<String,FieldProcessor>();
//TOKENIZER THE TEXT INTO COLUMN OF EACH WORDS USING TOKENPIPELINE OR WE CAN USE BASEPIPELINE
fieldProcessors.put("Species",new IdentityProcessor("Species"));
RowProcessor rowProcessor = new RowProcessor<Label>(new FieldResponseProcessor("Species","UNK",new LabelFactory()),fieldProcessors);
CSVDataSource irisData = new CSVDataSource<Label>(Paths.get("C:\\\\Users\\\\Nikki singh\\\\Downloads\\\\Iris.csv"),rowProcessor,true); 
TrainTestSplitter splitIrisData = new TrainTestSplitter<>(irisData,
                      /* Train fraction */ 0.7,
                            /* RNG seed */ 0L);
MutableDataset train = new MutableDataset(splitIrisData.getTrain());
MutableDataset test = new MutableDataset(splitIrisData.getTest());
//K MEANS CLUSTERING TRAINER
KMeansTrainer trainer = new KMeansTrainer(5,100,Distance.EUCLIDEAN,5,1234);
Long startTime = System.currentTimeMillis();  
KMeansModel model = trainer.train(train);
System.out.println("Feature:"+model.getFeatureIDMap().get("B"));
Long endTime = System.currentTimeMillis();
//WE CAN SEE OUR CLUSTERS(CENTROIDS) BEST 'K' CENTROID WE GOT AFTER 10 ITERATIONs
DenseVector[] centroids = model.getCentroidVectors();
for (DenseVector centroid : centroids) {
    System.out.println("cLUSTERS:"+centroid);
}
ClusteringEvaluator eva=new ClusteringEvaluator();
ClusteringEvaluation c= eva.evaluate(model,train);
System.out.println(c.adjustedMI()+"-----"+c.normalizedMI());

AND I M GETTING THIS:

Exception in thread "main" java.lang.ClassCastException: org.tribuo.classification.Label cannot be cast to org.tribuo.clustering.ClusterID

HOT TO EVALUATE AND SEE IF MY CLUSTERS CORRECTLY CLASSIFFIFIED CLASSES OR NOT and one more thing i have heard about K-mode clustering .is that kind of thing exist for now.?

Craigacp commented 3 years ago

You can't use a clustering evaluator on a classification dataset. You washed all the types off which removes the guarantees about the dataset, and the only reason it didn't throw a ClassCastException during model training is that the KMeansTrainer doesn't check the output type (which I guess it should do as this shouldn't have happened).

If you want to compare how the clusters line up against classification labels then you should write a new ResponseProcessor that converts the labels into numerical ids which can be consumed by the ClusteringFactory to create new ClusterIDs. That will give you a DataSource<ClusterID> and you can keep the generic types all through the computation. Then the ClusteringEvaluation will compute the mutual information between the predicted cluster ids and the ground truth "cluster ids" (i.e. the labels). If that mutual information is close to 1.0 then the clusters are similar to the labels, if it's close to 0.0 then the clusters and labels don't line up very well.

Looks like k-modes is for categorical data. We've not looked into that algorithm. We're considering adding k-medoids which guarantees that the cluster is a datapoint, rather than the mean of the cluster, but we've not implemented it yet.

nikkisingh111333 commented 3 years ago

okk so can you just pin me some code snippets of how can i use responseProcessor and how to use clusteringfactoy to get what i want..there are too many classes interfaces so kind of lost in docs.

Craigacp commented 3 years ago

Something like this should be sufficient. I've not tested it, and it's not going to be part of Tribuo.

public class IrisClusterResponseProcessor implements ResponseProcessor<ClusterID> {

    @Config(mandatory = true,description="The field name to read.")
    private String fieldName;

    private final OutputFactory<ClusterID> outputFactory = new ClusteringFactory();

    private IrisClusterResponseProcessor() {}

    public IrisClusterResponseProcessor(String fieldName) {
        this.fieldName = fieldName;
    }

    @Override
    public OutputFactory<ClusterID> getOutputFactory() {
        return outputFactory;
    }

    @Override
    public String getFieldName() {
        return fieldName;
    }

    @Deprecated
    @Override
    public void setFieldName(String fieldName) {
        this.fieldName = fieldName;
    }

    @Override
    public Optional<ClusterID> process(String value) {
        if ("Iris-setosa".equals(value)) {
            return Optional.of(outputFactory.generateOutput("0"));
        } else if ("Iris-versicolor".equals(value)) {
            return Optional.of(outputFactory.generateOutput("1"));
        } else if ("Iris-virginica".equals(value)) {
            return Optional.of(outputFactory.generateOutput("2"));
        } else {
            return Optional.empty();
        }
    }

    @Override
    public ConfiguredObjectProvenance getProvenance() {
        return new ConfiguredObjectProvenanceImpl(this,"ResponseProcessor");
    }
}

nikkisingh111333 commented 3 years ago

okkk so everytime i need to give dataset i need to do this okkk thanks ..is there any automatically stuff you are planning for this kind of stuffs. i have just done writing my own uniqueFeatureEncoder class by implementing FieldProcessor it was easy but if some param options is provided on API side will be good like for binarized feature.binary,for real data just add real.. btw thanks for your help .and quick support..

Craigacp commented 3 years ago

Well, in general you shouldn't try to feed classification data to a clustering task. Tribuo is setup to prevent users from confusing the prediction tasks like that, so it needs some tricks to make it work.

We add new implementations of FieldProcessor, ResponseProcessor etc as we discover a need for them either in the community or from our internal work. The columnar infrastructure is designed to be flexible, but it has to get the information about what the type of each feature is from somewhere, we don't want it to have to read the dataset twice just to understand it (nor do we want it to make guesses about the feature types as those might be tricky to undo. If you've got a concrete feature request then make a new issue and write it up in some detail, I'm not clear exactly what you're asking for at the moment as there are already field processors that emit binary features and real valued features.

nikkisingh111333 commented 3 years ago

okk one last think before wrapping up ...WHAT SHOULD I CHOOSE to make my feature categorical into unique encoding but just one feature not all features for all classes ..whether i should use REAL or CATEGORICAL or WHAT ?

Craigacp commented 3 years ago

I don't understand the question, could you give an example?

nikkisingh111333 commented 3 years ago

for eg.i have a column shirt size=[tiny,medium,large] and i want to encode it into numeric but no features for every class just want one single feature with class labelled as numbers[1=tiny,2=medium,3=large] so as i have written my own rowprocessor what should i use ? should i use this GeneratedFeatureType.BINARISED_CATEGORICAL or this GeneratedFeatureType.REAL or this GeneratedFeatureType.CATEGORICAL ....theres also on this GeneratedFeatureType.TEXT..which one should i choose ?

Craigacp commented 3 years ago

It's GeneratedFeatureType.CATEGORICAL. We're working on further changes to Tribuo's internal type system, as at the moment that enum only really interacts with the LIME explanation module, but in the future it will control how the feature statistics are computed.

nikkisingh111333 commented 3 years ago

yes i m waiting for it thank for your grest support.

On Thu, 10 Dec 2020 at 01:00, Adam Pocock notifications@github.com wrote:

It's GeneratedFeatureType.CATEGORICAL. We're working on further changes to Tribuo's internal type system, as at the moment that enum only really interacts with the LIME explanation module, but in the future it will control how the feature statistics are computed.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/oracle/tribuo/issues/105#issuecomment-741997592, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL3BT7E4NZBUO4GJOBII5NDST7F4LANCNFSM4UTONYBA .

oracle / tribuo

Label and Cluster evaluation? #105