openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
287 stars 76 forks source link

Cross-species label projection #112

Open olgabot opened 3 years ago

olgabot commented 3 years ago

Describe the problem concisely.

Many biological problems are first studied in nonhuman animals, and those cell types are often carefully characterized in e.g. mouse. We would like to take the cell type labels from a mouse dataset, and project it onto another species, such as human.

Propose datasets

We have datasets for both of these, subset down to 1:1 orthologous genes across the two species, with a unified "narrow group" of cell types across the species.

Propose methods

Existing batch correction or dataset alignment methods can work here

Propose metrics

Metrics for how correct the labels were predicted.

LuckyMD commented 3 years ago

Do you think this should be separated from the label projection task that already exists, or would you consider this as a dataset extension of the label projection task?

The methods will fit nicely with the data integration task that we still plan to add once my revisions are through ;).

olgabot commented 3 years ago

Hmm, my opinion is that it is a separate problem because I think different tools will excel in the cross-species realm than in the cross-dataset realm. Another consideration is that I chose the arbitrary cutoff of using only 1:1 orthologs, and some tool developers may choose to use e.g. orthogroups or some clever way of combining 1:many and many:many orthologs. What do you think?

scottgigante commented 3 years ago

the 1:many or many:many is an interesting and difficult question in defining the API here. We could choose to define it as having each cell belonging to between zero and n classes which would allow for a many to many classification

LuckyMD commented 3 years ago

I would say that orthologs/orthogroups are a question of defining a dataset for label propagation. In the end you will choose which subset of genes/features you make available to the method, no? In that case I would just run all the methods we have for label projection on this as well. I think it would actually be an interesting result if the method performance is different for the cross-species case. Maybe we need to start thinking of sub-task definitions based on sets of datasets here.

On another note, if the methods are MNN and BBKNN then it sounds like you are only trying to integrate the datasets rather than project labels. Or am I missing something where you intend to project labels based on an integration? This would be the type of task that I will be adding for data integration based on this paper. Would be great to work on that together :).