mlbio-epfl / turtle

[ICML 2024] Let Go of Your Labels with Unsupervised Transfer
https://brbiclab.epfl.ch/projects/turtle/
46 stars 5 forks source link

how to train my own dataset ? #10

Open cdongxian opened 2 weeks ago

agadetsky commented 2 weeks ago

Dear @cdongxian,

To add your own dataset you have to implement the dataset initialization pipeline in get_datasets function https://github.com/mlbio-epfl/turtle/blob/9b8bbb760224aed4e468363a3fc9ce43a388b6ee/dataset_preparation/data_utils.py#L70

Also, don't forget to specify the number of classes in your newly added dataset in datasets_to_c dictionary. You can specify the ground truth number or, in case you don't know it, a meaningful estimate of the number of clusters in your dataset. https://github.com/mlbio-epfl/turtle/blob/9b8bbb760224aed4e468363a3fc9ce43a388b6ee/utils.py#L99

After that, follow the README instructions in the repo to precompute representations. If you have ground truth labels for your dataset, then also use precompute_labels.py to precompute the labels for evaluation purposes in run_turtle.py script. If you don't have them, follow solutions of the similar issues, i.e., https://github.com/mlbio-epfl/turtle/issues/1 and https://github.com/mlbio-epfl/turtle/issues/5.

After everything above is prepared, you can run training using run_turtle.py script, specifying your dataset in the command line.

Let me know if that has resolved your issue.

Best, Artyom