sanderlab / CellBox

CellBox: Interpretable Machine Learning for Perturbation Biology
MIT License
54 stars 22 forks source link

Inconsistent drug indexing in loo_label.csv, expr_index.txt, and --drug_index argument #48

Open Mustardburger opened 1 year ago

Mustardburger commented 1 year ago

Can you provide more information about what each row and index in loo_label.csv and expr_index.txt represents? I believe it is the label of each drug perturbation, because each row in loo_label.csv corresponds to each row in pert.csv and expr.csv, but I cannot tell what the number indices in loo_label.csv represent.

From the paper, there are 12 drugs being tested. The --drug_index argument therefore refers to the drug that is left out during training. I would assume that, for example, when I ran python scripts/main.py -config=configs/Example.leave_one_out.json --drug_index 12, all the rows in pert.csv that belong to the drug at index 12 (indicated in loo_label.csv) are left out in the training set. However, with a closer look, I see that testidx (defined in dataset.py) contains the indices that points to rows in loo_label.csv that has the number 9. Similarly, setting --drug_index 11 points to rows with number 8, and so on. But setting --drug_index from 0 to 7 points correctly to rows in loo_label.csv that have that number.

Can you confirm with me if this is an expected bahaviour? This is important for me to test my pytorch dataloader to confirm it fetches the similar rows in pert.csv as the current tensorflow dataloader.