After that, follow the README instructions in the repo to precompute representations. If you have ground truth labels for your dataset, then also use precompute_labels.py to precompute the labels for evaluation purposes in run_turtle.py script. If you don't have them, follow solutions of the similar issues, i.e., https://github.com/mlbio-epfl/turtle/issues/1 and https://github.com/mlbio-epfl/turtle/issues/5.
After everything above is prepared, you can run training using run_turtle.py script, specifying your dataset in the command line.
Dear @cdongxian,
To add your own dataset you have to implement the dataset initialization pipeline in
get_datasets
function https://github.com/mlbio-epfl/turtle/blob/9b8bbb760224aed4e468363a3fc9ce43a388b6ee/dataset_preparation/data_utils.py#L70Also, don't forget to specify the number of classes in your newly added dataset in
datasets_to_c
dictionary. You can specify the ground truth number or, in case you don't know it, a meaningful estimate of the number of clusters in your dataset. https://github.com/mlbio-epfl/turtle/blob/9b8bbb760224aed4e468363a3fc9ce43a388b6ee/utils.py#L99After that, follow the README instructions in the repo to precompute representations. If you have ground truth labels for your dataset, then also use
precompute_labels.py
to precompute the labels for evaluation purposes inrun_turtle.py
script. If you don't have them, follow solutions of the similar issues, i.e., https://github.com/mlbio-epfl/turtle/issues/1 and https://github.com/mlbio-epfl/turtle/issues/5.After everything above is prepared, you can run training using
run_turtle.py
script, specifying your dataset in the command line.Let me know if that has resolved your issue.
Best, Artyom