openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
307 stars 78 forks source link

Spatial decomposition task #233

Closed vitkl closed 1 year ago

vitkl commented 3 years ago

Describe the problem concisely. Determining abundance of cell types (or expression of NMF programmes, etc) across spatial locations by integrating single cell or single nucleus RNA-seq and spatial transcriptomics data (Visium, Slide-Seq, NanostringWTA, ISS, etc). For example, as presented in our paper (https://www.biorxiv.org/content/10.1101/2020.11.15.378125v1v).

Propose datasets

Propose methods Cell2location: https://github.com/BayraktarLab/cell2location SPOTlight: https://github.com/MarcElosua/SPOTlight RCTD: https://github.com/dmcable/RCTD Stereoscope: https://github.com/YosefLab/scvi-tools/tree/master/scvi/external/stereoscope Seurat V3 anchor method: https://satijalab.org/seurat/articles/install.html

Propose metrics

LuckyMD commented 3 years ago

This sounds like a great idea! I think @giovp was also interested in this task I believe.

To facilitate defining an API, it might be good to pin the exact task description down a little bit more. Based on your datasets and proposed metrics I guess this might entail the cell type deconvolution of spatial spots? Should it then maybe be called "spatial spot deconvolution"? You can always introduce more tasks or make this a sub-task of a larger task if you think there is more to this problem.

vitkl commented 3 years ago

I agree a more precise description would help. How about Estimation of cell-type proportions per location/voxel (aka deconvolution)? I think "spatial spot deconvolution" need explaining: "spot" is not general language, "deconvolution" is technically incorrect and neither RCTD nor cell2location papers use this term. Maybe it can be kept for reference "aka deconvolution".

I see 2 related tasks (what do you think about adding them?):

  1. Estimation of absolute cell abundance per location/voxel. It is important because mRNA counts in spatial transcriptomics technologies are determined by absolute local cell densities (Saiselet et al, https://academic.oup.com/jmcb/article/12/11/906/5861536) and because it can be analysed downstream without challenges of the proportional data. However, most methods do not support this.

  2. Estimation of cell-type proportions per sample in bulk RNA-seq. Bulk RNA-seq analysis models can require different considerations because bulk data contains a much larger number of cells per sample, does not have UMI and most bulk analysis methods (MUSIC, Cibersort, etc) will not scale well to 1000s of locations used in the spatial benchmark.

Also adding @AlexanderAivazidis to the discussion as well.

LuckyMD commented 3 years ago

I love the 2 subtasks. I would focus on 1 for now... the second one is more likely to have some kind of ground truth I imagine.

vitkl commented 3 years ago

I would say that there are 3 sub-tasks:

  1. Estimation of relative cell abundance / cell-type proportions per location (supported by all methods)
  2. Estimation of absolute abundance
  3. Estimation of relative cell abundance in bulk RNA-seq

For 3, Bisque method paper (https://www.nature.com/articles/s41467-020-15816-6) indeed has nice ground truth.

hiraksarkar commented 3 years ago

Hi, I would suggest adding another layer of dimension to the problem is to consider the problem of decomposition of cell-types when there is no reference expression data present. I think reference-free and reference-based could be helpful.

LuckyMD commented 3 years ago

Estimation of relative cell abundance in bulk RNA-seq

While I agree that methods may be very similar. I would spin this out to a different task entirely. From a user perspective the situation will be quite different. And there are many bulk deconvolution methods not designed for spatial data (although that doesn't mean they can't be used there i guess).

vitkl commented 3 years ago

@LuckyMD Many bulk deconvolution methods are not scalable, and also most (except Bisque) do not account for technology platform difference. I agree that it is probably better to put that as a separate task and ideally get input from people actually working with this.

@hiraksarkar To me, there are 3 settings:

  1. Fully reference-based. This is considered by this task and performed by all methods considered. The results from these methods are highly interpretable.

  2. Fully reference-free. I would spin this out to a different task entirely - the purpose of which is not entirely clear to me. Reference-free factorisation does not achieve the task of determining locations of cell types. De-novo factorisation of spatial data with either log-normal factor analysis, NMF or VAE is in no way guaranteed to give the interpretation of cell types. In my experience, cell2location with reference-free factors finds 'expression programmes' that are both spatially restricted and span across cell types - so the factors represent neither cell types nor tissue zones composed of multiple cell types. This can be a separate task but more work need to be done to define the task. WDYT?

  3. Mixed setting with both reference-based and reference-free factors. This is very challenging because, in addition to the interpretation issue in point 2, reference-free and reference-based factors are non-identifiable unless informative priors/penalisation are used. Determining these priors adds another layer of complexity which no method was able to address so far. It will likely be easier with high-res technologies. Do you know anyone who had success with this?

vitkl commented 3 years ago

Another dimension to this problem, that I would add, is how you estimate signatures of cell types from multiple scRNA-seq batches and technologies - accounting for sequencing depth differences, contaminating RNA and gene-specific platform effects within scRNA. We found that it is important to account for strong batch and technology effects (see attached). One can also consider hierarchical annotations across a range of resolutions. A lot more work can be done in that space - which is also related to determining DE genes between cell types.

hiraksarkar commented 3 years ago

Hi @vitkl ,

Great point about the case of mixed semi-supervised information. I have not thought about that case yet, and to my knowledge, I have not seen this in publication. For 2, it's kind of what I am researching right now, not sure if I would be able to give a solution during the time of the hackathon, but I hope we can chat more after the event and I would love to update you guys on my progress.

giovp commented 3 years ago

Another dimension to this problem, that I would add, is how you estimate signatures of cell types from multiple scRNA-seq batches and technologies - accounting for sequencing depth differences, contaminating RNA and gene-specific platform effects within scRNA

totally agree, I think for now is a bit hacky (as we would have to hard code several instances of the same dataset with diff feature selection step) but very good to keep it mind when subtasks will be a thing!

LuckyMD commented 3 years ago

Another dimension to this problem, that I would add, is how you estimate signatures of cell types from multiple scRNA-seq batches and technologies - accounting for sequencing depth differences, contaminating RNA and gene-specific platform effects within scRNA

To me this is not a subtask but a completely separate task that was proposed already in #110

giovp commented 3 years ago

To me this is not a subtask but a completely separate task that was proposed already in #110

mmh I disagree, as in the feature selection here is to maximize the deconvolution results, in constrast to linked issue where it's about finding some ground truth or separation in manifold (e.g. more easy to cluster cells). this is very relevant both for array-based spatial tech as well as image-based, where you have fewer features.

vitkl commented 3 years ago

Another dimension to this problem, that I would add, is how you estimate signatures of cell types from multiple scRNA-seq batches and technologies - accounting for sequencing depth differences, contaminating RNA and gene-specific platform effects within scRNA

To me this is not a subtask but a completely separate task that was proposed already in #110

The goal of this step is to estimate the average mRNA count of each gene in each cluster, accounting for platform and batch effects. The goal is not to get a binary list of 'significant' marker genes for each cluster. The goal is not to select N genes out of 20k genes (which is a separate task) but to estimate the biological expression of N genes or all 20k genes in each cell cluster.