openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
287 stars 76 forks source link

[Task proposal] Multi-omics manifold mapping #281

Open gjhuizing opened 3 years ago

gjhuizing commented 3 years ago

Describe the problem concisely. The current multimodal integration task focuses on alignment of different datasets, profiled from different cells. It is tested by separating data coming from the same cell and evaluating whether the alignment corresponds to the true cell matches.

However some new methods perform integration in a different way. They take into account both modalities and map the cells to a latent space, where they can be visualized and clustered.

Propose datasets Any multi-omics dataset with cell type / cell line annotations.

Propose methods

Propose metrics Given the original labels (cell types/cell lines):

Question If I want to implement this, should i create a new task with a different name (something like same-cell-multiomics-integration ?), or create a subtask? i didn't find much documentation on subtasks, except the last 20s of the video tutorial on creating tasks, so I'm not sure how to do that

LuckyMD commented 3 years ago

Hi @gjhuizing,

I would create a new task with a similar name. We can then make the folder structure match the subtask definition.

A few considerations here:

  1. Check out this PR on batch integration that @danielStrobl and I have been working on. There will be a lot of overlaps for methods and metrics for this task I imagine. Especially for the embedding version of this task (a further subtask). This might be something to work together on.
  2. This type of task will heavily rely on the quality of the annotations for the ground-truth data. Are the ground truth sufficient in the dataset you have proposed?
  3. What output would you require from a method? Always a joint embedding? What about graph outputs like BBKNN or CONOS. Would you also include those/can they be applied to multi-modal integration?
  4. Does this task need to be specific to 2 modalities or would you make it general? If general, you would require a good way to convey what type of data is provided in the data loaders.
gjhuizing commented 3 years ago

Hi @LuckyMD, thanks for your answer. I'm creating a new task then!

  1. In terms of methods, maybe LIGER? If methods and metrics can be reused in both tasks that would be great.
  2. The quality of the annotations is critical indeed. If I'm not mistaken, cell lines offer a perfect ground truth and are often used to benchmark integration methods. However they are a bit too easy of a problem so if there are more realistic datasets out there with a reliable ground truth that could be good.
  3. I'm not that familiar with methods that output graphs rather than embeddings, but I suppose we could have some common metrics. Maybe the silhouette score can be computed on graphs (through geodesic distance?). What I wanted to avoid is to evaluate the clustering part, in order to focus on the quality of integration. But maybe that's not the best approach
  4. Indeed! Ideally we should be able to have more than two omics, and to specify which omics they are to the method
LuckyMD commented 3 years ago

The quality of the annotations is critical indeed. If I'm not mistaken, cell lines offer a perfect ground truth and are often used to benchmark integration methods. However they are a bit too easy of a problem so if there are more realistic datasets out there with a reliable ground truth that could be good.

this is exactly my thought as well ;). Good start for sure with cell lines though!

Indeed! Ideally we should be able to have more than two omics, and to specify which omics they are to the method

If you do this, you will require every method to be applicable to every omics layer. I would recommend making the task more specific first, and then think about sharing methods between further sibling tasks.

I'm not that familiar with methods that output graphs rather than embeddings, but I suppose we could have some common metrics. Maybe the silhouette score can be computed on graphs (through geodesic distance?). What I wanted to avoid is to evaluate the clustering part, in order to focus on the quality of integration. But maybe that's not the best approach

We have a preprint on data integration benchmarking dealing with a lot of these issues for a single modality (https://www.biorxiv.org/content/10.1101/2020.05.22.111161v2). These metrics are all being added in the PR I linked above. Maybe it would be good to chat about this in a voice channel in batch integration?

gjhuizing commented 3 years ago

definitely, sending you a message on discord