glevyhas / pix-plot

A WebGL viewer for UMAP or TSNE-clustered images
MIT License
596 stars 139 forks source link

Is it possible to do Supervised UMap dimension reduction? #122

Closed sachinsharma9780 closed 3 years ago

sachinsharma9780 commented 4 years ago

I have images and their corresponding labels and want to perform supervised UMAP dimensionality reduction and then visualization. So is it possible to do that?

duhaime commented 4 years ago

Thanks for this great idea @sachinsharma9780! This isn't currently supported but it should be.

Here's what I'm thinking: right now users can specify a metadata file that includes some information (e.g. date, categorical labels...), about the images in the input image glob. I think it would make sense to update that metadata file to allow users to specify a label attribute that could be used for supervised / semi-supervised UMAP projection.

Does that sound like a good direction forward?

sachinsharma9780 commented 4 years ago

yeah, the idea seems good. That would be really helpful. So if you complete this integration then let me know. Any rough idea when it will be completed. Also, thank you for this nice and fast library for visualization.

duhaime commented 4 years ago

@sachinsharma9780 Thanks again for this idea. I just pushed a branch that allows one to create a supervised UMAP projection. To do so, you should first update your installed version of pixplot:

pip install pixplot==0.0.89

Note that it may take a few minutes for PyPi to make 0.0.89 available. Once that's installed you just need to create a csv metadata file that looks like:

filename,label
10.png,0
7812.png,1
2669.png,2
101.png,0

where the label column designates the categorical attribute for the specified filename to be used in the supervised projection. The dataset above is MNIST, so the categorical label for each image is a digit, but any categorical label can be used. In the case of MNIST, the supervised projection separates the digits quite nicely (here's a detail):

Screen Shot 2020-07-07 at 11 24 00 AM

If you hit any bumps please feel free to follow up and we can get them sorted...

sachinsharma9780 commented 4 years ago

@duhaime Thank you very much for this quick integration. I will test this with my data and let you know.

sachinsharma9780 commented 4 years ago

Hi, I ran it on my data which has 8 labels but during the processing and in visualization it creates 34 clusters. As in above image which you have shown number of clusters shud be equal to number of classes. I have created a meta data file as mentioned by you below : snip Is there any flag which we need to mention to differentiate between sup and unsup reduction?

duhaime commented 4 years ago

Thanks for your follow up. The clustering is distinct from the UMAP projection. The clustering is based on the UMAP projection, but the number of clusters is not guaranteed to be identical to the number of distinct labels.

That said, this seems like a large delta. Did you pass your metadata file to the pixplot command? You should use something like:

pixplot --images "photos/*.jpg" --metadata "photo_metadata.csv"

Also, you should delete the ./output directory before rerunning the pixplot command, as the code will reuse the umap projection it finds in ./output to save time.

If you delete the output folder and rerun the command with the --metadata flag, I'll be interested to hear how it turns out...

duhaime commented 4 years ago

This is an interesting point. My hunch is that there is a decent amount of variance among the images that share a common label in your dataset. In the case of MNIST, there's very little variation (relatively speaking) among images that share a label, so the images separate nicely. Based on the information you've sent, though, it seems there is more variance among observations that share a label in your dataset, hence the increased number of clusters.

I was previously thinking that one would only use the labels to influence the projection, but perhaps the labels should influence the clustering as well. If a user provides labels for their data, perhaps the plot should create one cluster for each label. Then the clusters could use the label as the text label for the cluster as well (instead of "Cluster 1", "Cluster 2"). Does that sound like a better path forward?

sachinsharma9780 commented 4 years ago

You are right about the variance issue. For supervised learning using text labels instead of Cluster 1, Cluster 2 seems a much better idea. But then how you will decide which cluster belongs to which class/label?

duhaime commented 4 years ago

I'm still thinking about the appropriate cluster strategy in the supervised case. There are a few options...

One quick question just to make sure: what do you get when you run pip freeze | grep umap?

Also, do all of your observations have a class label, or are some missing labels?

sachinsharma9780 commented 4 years ago

after running the command I get : umap-learn==0.4.5 every observation has a class label.

duhaime commented 4 years ago

Amen, that's all as expected. We're reconfiguring the handling of clusters now, so I'll take a look at using the labels to create the set of hotspots...

sachinsharma9780 commented 4 years ago

Cool. Let me know when you are finished..

duhaime commented 4 years ago

@sachinsharma9780 we've overhauled some of the hotspot / cluster logic in 0.0.91.

That said, there's still no automatic mechanism for generating exactly one cluster for each distinct value among the class labels assigned to input images.

I think the best way forward for this use case would be:

1) Use the new hotspot buttons to delete all the hotspots generated algorithmically. 2) Switch to the UMAP layout 3) Use the select dropdown to identify all of the images that have a particular class label 4) Manually draw a lasso around those images 5) Click the button to save that collection of images as a cluster 6) Repeat 3-5 for each other class label in your dataset 7) Save the updated hotspots and move user_hotspots.json to output/data/hotspots/user_hotspots.json

NB: For this to work, you should add a column category in your metadata file that's identical to your label column.

If you try this out, I'd be keen to hear your thoughts on the hotspot editing workflow!

duhaime commented 3 years ago

@sachinsharma9780 I'm going to close this issue out but if the above workflow doesn't work for you or if you hit other errors or come across other enhancement ideas, please feel free to raise a new issue!