openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
314 stars 78 forks source link

[Label projection] Dataset zebrafish_labs does not have equal distribution of cell types per lab #772

Closed mxposed closed 1 month ago

mxposed commented 1 year ago

See attachment:

Screen Shot 2023-01-06 at 6 18 32 PM

I propose to remove zebrafish_labs dataset altogethe (and then redo the whole zebrafish dataset from scratch)

LuckyMD commented 1 year ago

Why would you remove the dataset? This is a realisitic scenario, no? Or do you think the labels weren't harmonized correctly across the 2 labs?

mxposed commented 1 year ago

Yes, I suspect that the labels are lab-specific, as you can see with the zeroes above. Alternatively, our subsampling of zebrafish data did not stratify by labels

LuckyMD commented 1 year ago

Looking at the labels it seems to me that the non-shared labels are non-shared cell types, which would be fine, no? I would only redo this dataset if you find that there are two cell type labels that should be the same, but are currently not.

Stratified subsampling would probably be important.

mxposed commented 1 year ago

In our version of the dataset there are only 26k cells, out of 90k mentioned in the paper. This subsetting was done separately and I don't have code for that, so I cannot check if the original data had those cell type labels for both labs and it's the subsampling at fault, or if cell type labels were also not harmonized across labs, or if labs profiled different stages etc.

In any way, I think the current version of the task implies that label set is the same between train & test. I want to add a separate subtask where this assumption is broken—predicting “unseen” cells, but not there yet. So for me then this dataset does not contribute useful info about method performance.

LuckyMD commented 1 year ago

I see.. maybe @dburkhardt can give some insight here, as he added this dataset and did the preprocessing I believe. I think it might be used in the MELD paper (not 100% sure though).

scottgigante-immunai commented 1 year ago

In our version of the dataset there are only 26k cells, out of 90k mentioned in the paper

This is a major concern.

I want to add a separate subtask where this assumption is broken—predicting “unseen” cells, but not there yet

Once we add that task I think we can move this dataset from here to there. Until then I think we should leave it, as it is basically the only nontrivial dataset in this task right now.

dburkhardt commented 1 year ago

Ugh you're asking me about data I haven't looked at in 4 years and for which I don't have the processing scripts...

@mxposed when you say "90k mentioned in the paper" -- which paper? In MELD I didn't use the time course data, I used perturbation data, which was totally separate.

I did think there were more cells in the time courses, but maybe I'm just getting thrown by the increase in cells in recent papers.

mxposed commented 1 year ago

This is the paper that is currently referenced in the code: https://www.science.org/doi/10.1126/science.aar4362 In the abstract it says 90k, and 90k was mentioned in the dataset description too

dburkhardt commented 1 year ago

To be honest, I don't really know what to say here. It might not be the worst thing to re-process the data using STARsolo for both datasets. The issue is that the GTF references in each paper are different, one uses ZFIN gene annotation, the other Ensembl. As a result, you can't compare the features directly.

mxposed commented 1 year ago

Thank you, I didn't realize that. I think reprocessing will be the right thing to do

dburkhardt commented 1 year ago

Sorry just edited that to be clear. In the published paper, the Klein and Regev labs used different references in their published processed data files. I downloaded the raw fastqs and re-aligned to a single genome using STARsolo.

If you do undertake this, I recommend saving the scripts :)

github-actions[bot] commented 1 month ago

This issue has been automatically closed because it has not had recent activity.