Closed mxposed closed 1 month ago
Why would you remove the dataset? This is a realisitic scenario, no? Or do you think the labels weren't harmonized correctly across the 2 labs?
Yes, I suspect that the labels are lab-specific, as you can see with the zeroes above. Alternatively, our subsampling of zebrafish data did not stratify by labels
Looking at the labels it seems to me that the non-shared labels are non-shared cell types, which would be fine, no? I would only redo this dataset if you find that there are two cell type labels that should be the same, but are currently not.
Stratified subsampling would probably be important.
In our version of the dataset there are only 26k cells, out of 90k mentioned in the paper. This subsetting was done separately and I don't have code for that, so I cannot check if the original data had those cell type labels for both labs and it's the subsampling at fault, or if cell type labels were also not harmonized across labs, or if labs profiled different stages etc.
In any way, I think the current version of the task implies that label set is the same between train & test. I want to add a separate subtask where this assumption is broken—predicting “unseen” cells, but not there yet. So for me then this dataset does not contribute useful info about method performance.
I see.. maybe @dburkhardt can give some insight here, as he added this dataset and did the preprocessing I believe. I think it might be used in the MELD paper (not 100% sure though).
In our version of the dataset there are only 26k cells, out of 90k mentioned in the paper
This is a major concern.
I want to add a separate subtask where this assumption is broken—predicting “unseen” cells, but not there yet
Once we add that task I think we can move this dataset from here to there. Until then I think we should leave it, as it is basically the only nontrivial dataset in this task right now.
Ugh you're asking me about data I haven't looked at in 4 years and for which I don't have the processing scripts...
@mxposed when you say "90k mentioned in the paper" -- which paper? In MELD I didn't use the time course data, I used perturbation data, which was totally separate.
I did think there were more cells in the time courses, but maybe I'm just getting thrown by the increase in cells in recent papers.
This is the paper that is currently referenced in the code: https://www.science.org/doi/10.1126/science.aar4362 In the abstract it says 90k, and 90k was mentioned in the dataset description too
To be honest, I don't really know what to say here. It might not be the worst thing to re-process the data using STARsolo for both datasets. The issue is that the GTF references in each paper are different, one uses ZFIN gene annotation, the other Ensembl. As a result, you can't compare the features directly.
Thank you, I didn't realize that. I think reprocessing will be the right thing to do
Sorry just edited that to be clear. In the published paper, the Klein and Regev labs used different references in their published processed data files. I downloaded the raw fastqs and re-aligned to a single genome using STARsolo.
If you do undertake this, I recommend saving the scripts :)
This issue has been automatically closed because it has not had recent activity.
See attachment:
I propose to remove zebrafish_labs dataset altogethe (and then redo the whole zebrafish dataset from scratch)