tanaylab / metacells

Metacells - Single-cell RNA Sequencing Analysis
MIT License
86 stars 8 forks source link

enforcing metacells by donor #28

Closed hoondy closed 1 year ago

hoondy commented 1 year ago

Thanks for building a great tool for analyzing single-cell data. I was wondering if there is an option to enforce the sampling of metacells by donors or replicates. What I am trying to accomplish here is to do some type of case-control analysis. However, I haven't figured out an easy way to group metacells by different donors or conditions. The alternative way is to run metacells by donors and merge them but this is inefficient. Can you help?

orenbenkiki commented 1 year ago

Not sure what you mean. If you wish all metacells to come from exactly one donor/batch, the simplest way would be to run metacells algorithm separately on the cells of each donor - why would that be less efficient?

Also note that if the amount of data isn't very large and you have access to a multi-core server (or several servers), you can run the grouping of each donor in parallel.

hoondy commented 1 year ago

Say we have single cell dataset consist of 100 donors, we want to perform qc, variable feature selection, and batch correction from whole dataset all together. However, if I were to run metacells pipeline separately for each donor, we would end up with different qc, variable gene selection per donor. If we merge metacells from each donor, features won't agree with each other. I wonder if this is the correct way to apply metacells for a large dataset like this.

orenbenkiki commented 1 year ago

Batch correction is a PITA and while MCs can help, they don't offer a silver bullet, I'm afraid. In out experience dealing with such issues ends up being an iterative process which depends on the specific case.

For example, you could just compute MCs over the whole dataset to start with, and then look at the fraction of the cells in each MC from each donor. If there is no strong batch effect, the MCs will consist of a mixture of cells from many donors. If there is a strong batch effect, some MCs will contain cells from only a few donors. It would then be possible to do differential expression between these MCs and the other MCs "of the same type" from other donors to try and understand the batch effect - these may be type-specific differences or global differences across different types.

Other methods are possible. For example, do the above, and notice that the donors seem to be in several groups where MCs mix up cells from donors in each group but not between groups. You may be able to correlate this with some donor metadata (age, sex, pathology, experiment, etc.). You could also split the data set to these groups and compute MCs separately for each such group, then do differential expression analysis on MCs between MCs of "the same type" between the groups. Depending on the experiment sampling, you may also be able to say something about the relative fraction of MCs of different types in each group.

If you have enough cells for each donor, you could generate MCs for each donor and then do differential expression analysis between MCs o "the same type" across all donors, but that would be more work and probably less precise then using groups of donors which are "similar" to each other.

You can also apply any of the plethora of batch-normalization methods that exist for scRNA-seq data, either before running the MCs or (e.g. if you do MCs per donor or per group) after running the MCs (the advantage here is that MCs are much easier to work with than single cells).

So... it depends on both your experiments construction and on what you'll see in the results. Sorry I don't have a simpler answer but there just doesn't seem to be one.

hoondy commented 1 year ago

Thanks for sharing your thoughts on this. I think you answered my question, and yes, there is no simpler solution to this. I will close the issue.