mlbio-epfl / turtle

[ICML 2024] Let Go of Your Labels with Unsupervised Transfer
https://brbiclab.epfl.ch/projects/turtle/
46 stars 5 forks source link

number of classes value when training without labels? #9

Open easternbun opened 1 month ago

easternbun commented 1 month ago

Awesome Work!!!! I am study your paper and code for my project. There are a few question if you could help me out:

I have unlabeled images to train and test. What should I set for number of classes value for task_encoder?

1st Follow Up: I tested it by randomly setting the number to 100, and the clustering result is not good. The result is basically every image used in testing has its seperate class. It only gets better when C value is set close to actual number of classes.

What I did:

  1. take 6 classes from caltech101 dataset and put all images under my own dataset folder.
  2. Altered the data_utils.py to just read images x and no labels y. Ran recompute_representation.py with 1space on my dataset.
  3. Skipped the precompute_label step and altered the run_turtle.py by commit out label accuracy calculations. Set the number_of_classes or C value 100.
  4. train by run_turtle.py and evaluate. Am expecting same output for similar images.

2nd Follow Up: I trained with a dataset conjured up with images from internet on mechanical objects. all labeled by putting them under different class folders I ran the codes (original/github version) and the accuracy could not surpass 0.28 no matter the size of dataset.

agadetsky commented 1 month ago

Dear @easternbun,

As you noticed, the number of clusters should be set as close as possible to the ground truth number of clusters. In case you don't know the exact number of clusters, you should provide a meaningful guess to it or use some approaches to estimate it from the data. Overall, clustering is ambiguous problem to some extent, therefore the number of clusters should be defined by a user.

As a result, it is expected that when you run TURTLE to find 100 clusters on the subset of caltech101 that contains only 6 classes and further evaluate the quality of the found clusters with respect to the original caltech101 labeling, you will obtain bad accuracy. Overall, if you somehow tweak the dataset presented in our codebase, i.e., taking subset of a dataset like you did with caltech101, also don't forget to change the ground truth number of clusters for the proper evaluation.

Regarding your second question, there might be many reasons: (1) low-quality representations for this particular dataset; (2) inappropriate hyperparameters and etc. As a sanity check, I would first try train a linear model on top of the representations that you are using to fit ground truth labeling. You can use baselines/linear_probe.py for that. If it gives you meaningful quality, then you can play with the hyperparameters of TURTLE, however it's unclear to me why you would even need clustering algorithm if you have the ground truth labeling available for your particular application.

Let me know if my response has resolved your issues.

Best, Artyom