Closed nikkisingh111333 closed 3 years ago
You can't use a clustering evaluator on a classification dataset. You washed all the types off which removes the guarantees about the dataset, and the only reason it didn't throw a ClassCastException
during model training is that the KMeansTrainer
doesn't check the output type (which I guess it should do as this shouldn't have happened).
If you want to compare how the clusters line up against classification labels then you should write a new ResponseProcessor
that converts the labels into numerical ids which can be consumed by the ClusteringFactory
to create new ClusterID
s. That will give you a DataSource<ClusterID>
and you can keep the generic types all through the computation. Then the ClusteringEvaluation
will compute the mutual information between the predicted cluster ids and the ground truth "cluster ids" (i.e. the labels). If that mutual information is close to 1.0 then the clusters are similar to the labels, if it's close to 0.0 then the clusters and labels don't line up very well.
Looks like k-modes is for categorical data. We've not looked into that algorithm. We're considering adding k-medoids which guarantees that the cluster is a datapoint, rather than the mean of the cluster, but we've not implemented it yet.
okk so can you just pin me some code snippets of how can i use responseProcessor and how to use clusteringfactoy to get what i want..there are too many classes interfaces so kind of lost in docs.
Something like this should be sufficient. I've not tested it, and it's not going to be part of Tribuo.
public class IrisClusterResponseProcessor implements ResponseProcessor<ClusterID> {
@Config(mandatory = true,description="The field name to read.")
private String fieldName;
private final OutputFactory<ClusterID> outputFactory = new ClusteringFactory();
private IrisClusterResponseProcessor() {}
public IrisClusterResponseProcessor(String fieldName) {
this.fieldName = fieldName;
}
@Override
public OutputFactory<ClusterID> getOutputFactory() {
return outputFactory;
}
@Override
public String getFieldName() {
return fieldName;
}
@Deprecated
@Override
public void setFieldName(String fieldName) {
this.fieldName = fieldName;
}
@Override
public Optional<ClusterID> process(String value) {
if ("Iris-setosa".equals(value)) {
return Optional.of(outputFactory.generateOutput("0"));
} else if ("Iris-versicolor".equals(value)) {
return Optional.of(outputFactory.generateOutput("1"));
} else if ("Iris-virginica".equals(value)) {
return Optional.of(outputFactory.generateOutput("2"));
} else {
return Optional.empty();
}
}
@Override
public ConfiguredObjectProvenance getProvenance() {
return new ConfiguredObjectProvenanceImpl(this,"ResponseProcessor");
}
}
okkk so everytime i need to give dataset i need to do this okkk thanks ..is there any automatically stuff you are planning for this kind of stuffs. i have just done writing my own uniqueFeatureEncoder class by implementing FieldProcessor it was easy but if some param options is provided on API side will be good like for binarized feature.binary,for real data just add real.. btw thanks for your help .and quick support..
Well, in general you shouldn't try to feed classification data to a clustering task. Tribuo is setup to prevent users from confusing the prediction tasks like that, so it needs some tricks to make it work.
We add new implementations of FieldProcessor
, ResponseProcessor
etc as we discover a need for them either in the community or from our internal work. The columnar infrastructure is designed to be flexible, but it has to get the information about what the type of each feature is from somewhere, we don't want it to have to read the dataset twice just to understand it (nor do we want it to make guesses about the feature types as those might be tricky to undo. If you've got a concrete feature request then make a new issue and write it up in some detail, I'm not clear exactly what you're asking for at the moment as there are already field processors that emit binary features and real valued features.
okk one last think before wrapping up ...WHAT SHOULD I CHOOSE to make my feature categorical into unique encoding but just one feature not all features for all classes ..whether i should use REAL or CATEGORICAL or WHAT ?
I don't understand the question, could you give an example?
for eg.i have a column shirt size=[tiny,medium,large] and i want to encode it into numeric but no features for every class just want one single feature with class labelled as numbers[1=tiny,2=medium,3=large] so as i have written my own rowprocessor what should i use ? should i use this GeneratedFeatureType.BINARISED_CATEGORICAL or this GeneratedFeatureType.REAL or this GeneratedFeatureType.CATEGORICAL ....theres also on this GeneratedFeatureType.TEXT..which one should i choose ?
It's GeneratedFeatureType.CATEGORICAL
. We're working on further changes to Tribuo's internal type system, as at the moment that enum only really interacts with the LIME explanation module, but in the future it will control how the feature statistics are computed.
yes i m waiting for it thank for your grest support.
On Thu, 10 Dec 2020 at 01:00, Adam Pocock notifications@github.com wrote:
It's GeneratedFeatureType.CATEGORICAL. We're working on further changes to Tribuo's internal type system, as at the moment that enum only really interacts with the LIME explanation module, but in the future it will control how the feature statistics are computed.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/oracle/tribuo/issues/105#issuecomment-741997592, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL3BT7E4NZBUO4GJOBII5NDST7F4LANCNFSM4UTONYBA .
hey,I m playing around with k means clustering and at evaluation time i was playing with Old friend Iris data.so i m new to unsupervised learning heres my question: how to evaluate cluster with label column 'species in case of iris' heres what i m doing simple stuff for getting familiar with clustering in tribuo:
AND I M GETTING THIS:
HOT TO EVALUATE AND SEE IF MY CLUSTERS CORRECTLY CLASSIFFIFIED CLASSES OR NOT and one more thing i have heard about K-mode clustering .is that kind of thing exist for now.?