Closed sachinsharma9780 closed 3 years ago
Thanks for this great idea @sachinsharma9780! This isn't currently supported but it should be.
Here's what I'm thinking: right now users can specify a metadata file that includes some information (e.g. date, categorical labels...), about the images in the input image glob. I think it would make sense to update that metadata file to allow users to specify a label
attribute that could be used for supervised / semi-supervised UMAP projection.
Does that sound like a good direction forward?
yeah, the idea seems good. That would be really helpful. So if you complete this integration then let me know. Any rough idea when it will be completed. Also, thank you for this nice and fast library for visualization.
@sachinsharma9780 Thanks again for this idea. I just pushed a branch that allows one to create a supervised UMAP projection. To do so, you should first update your installed version of pixplot:
pip install pixplot==0.0.89
Note that it may take a few minutes for PyPi to make 0.0.89 available. Once that's installed you just need to create a csv metadata file that looks like:
filename,label
10.png,0
7812.png,1
2669.png,2
101.png,0
where the label
column designates the categorical attribute for the specified filename to be used in the supervised projection. The dataset above is MNIST, so the categorical label for each image is a digit, but any categorical label can be used. In the case of MNIST, the supervised projection separates the digits quite nicely (here's a detail):
If you hit any bumps please feel free to follow up and we can get them sorted...
@duhaime Thank you very much for this quick integration. I will test this with my data and let you know.
Hi, I ran it on my data which has 8 labels but during the processing and in visualization it creates 34 clusters. As in above image which you have shown number of clusters shud be equal to number of classes. I have created a meta data file as mentioned by you below : Is there any flag which we need to mention to differentiate between sup and unsup reduction?
Thanks for your follow up. The clustering is distinct from the UMAP projection. The clustering is based on the UMAP projection, but the number of clusters is not guaranteed to be identical to the number of distinct labels.
That said, this seems like a large delta. Did you pass your metadata file to the pixplot command? You should use something like:
pixplot --images "photos/*.jpg" --metadata "photo_metadata.csv"
Also, you should delete the ./output
directory before rerunning the pixplot command, as the code will reuse the umap projection it finds in ./output
to save time.
If you delete the output folder and rerun the command with the --metadata
flag, I'll be interested to hear how it turns out...
This is an interesting point. My hunch is that there is a decent amount of variance among the images that share a common label in your dataset. In the case of MNIST, there's very little variation (relatively speaking) among images that share a label, so the images separate nicely. Based on the information you've sent, though, it seems there is more variance among observations that share a label in your dataset, hence the increased number of clusters.
I was previously thinking that one would only use the labels to influence the projection, but perhaps the labels should influence the clustering as well. If a user provides labels for their data, perhaps the plot should create one cluster for each label. Then the clusters could use the label as the text label for the cluster as well (instead of "Cluster 1", "Cluster 2"). Does that sound like a better path forward?
You are right about the variance issue. For supervised learning using text labels instead of Cluster 1, Cluster 2 seems a much better idea. But then how you will decide which cluster belongs to which class/label?
I'm still thinking about the appropriate cluster strategy in the supervised case. There are a few options...
One quick question just to make sure: what do you get when you run pip freeze | grep umap
?
Also, do all of your observations have a class label, or are some missing labels?
after running the command I get : umap-learn==0.4.5 every observation has a class label.
Amen, that's all as expected. We're reconfiguring the handling of clusters now, so I'll take a look at using the labels to create the set of hotspots...
Cool. Let me know when you are finished..
@sachinsharma9780 we've overhauled some of the hotspot / cluster logic in 0.0.91.
That said, there's still no automatic mechanism for generating exactly one cluster for each distinct value among the class labels assigned to input images.
I think the best way forward for this use case would be:
1) Use the new hotspot buttons to delete all the hotspots generated algorithmically.
2) Switch to the UMAP layout
3) Use the select dropdown to identify all of the images that have a particular class label
4) Manually draw a lasso around those images
5) Click the button to save that collection of images as a cluster
6) Repeat 3-5 for each other class label in your dataset
7) Save the updated hotspots and move user_hotspots.json
to output/data/hotspots/user_hotspots.json
NB: For this to work, you should add a column category
in your metadata file that's identical to your label
column.
If you try this out, I'd be keen to hear your thoughts on the hotspot editing workflow!
@sachinsharma9780 I'm going to close this issue out but if the above workflow doesn't work for you or if you hit other errors or come across other enhancement ideas, please feel free to raise a new issue!
I have images and their corresponding labels and want to perform supervised UMAP dimensionality reduction and then visualization. So is it possible to do that?