Is there a way to automate this pipeline?

crotoc commented 11 months ago

Thanks for the great tool. I need some suggestions about how to run metacell on several hundreds of datasetd. I figured there is an iterative step and need to probe by user. This is not feasible because I have so many count matrices to run. Can I specify a fix number of rounds or there are criteria I can implement by code? My purpose is to reduce the cell numbers to meta cells to a managable dimension of count matrices. Please let me know. Thanks!

orenbenkiki commented 11 months ago

Not sure what you are asking.

Just invoking the metacells pipeline is normal Python code and you should be able to create a script that does it in a loop over a list of datasets. Then it is just a matter of waiting for the computer to process them all. For such scripts it is typically better to not use a notebook, instead using a standalone script (possibly driven by a makefile or something along these lines).

However, giving type annotations to the resulting metacells, detecting which additional genes should be excluded or marked as lateral, and in general evaluating the quality of the result are very difficult to automate - currently, the best we can do is to provide visualization and require the analyst to manually make the necessary decisions. It isn't practical to do this "many" times for many different data sets.

It isn't clear to me why you have so many different datasets which you want to separately compute metacells for. Typically with large experiments with a lot of data (that is, consist of many "batches"), the common practice is to merge them all into a single data set and analyze it as a whole. Cell states which are common to all batches will result in metacells which include cells from multiple batches, while cell states which are unique to one (or a few) batches will result in metacells which include cells only from these batch(es). One can count what fraction of the cells of each metacell come from each batch and analyze that to gain insight on the result.

This of course requires the analysis to deal with "batch effects" - technical differences which distinguish between the batches (e.g., different batches are exposed to different amount of stress during processing). This may require additional lateral genes (e.g. stress genes), and/or using any of the wide array of available batch-normalization techniques used in scRNA analysis.

crotoc commented 11 months ago

Thanks for the reply! Let me explain my application a little bit. The goal of my work is obtaining the gene network for each single cell (not cell type). The method used is locCSN (PNAS). Because the method is very computational intensive, so the method recommands to reduce the cells to metacell. The method runs in each cell type and we have about 200 types of cells in total across all human tissues. They are from multiple sources and batch effects should be observed. I want to run metacell in each cell type to avoid dealng with the batch effect because locCSN also run in each cell type, so it natually adapt to my workflow. I want to automate metacell to run in 200 cell types and the output is used as input of locCSN. Hope I make it clear. Thanks for you suggestion. Please let me know if you have further advices.

orenbenkiki commented 11 months ago

Interesting. Is it the case that each cell type comes from a single source so you do not have batch effects within each cell type? That is, each batch (source) contains only cells of a single type, and each cell type comes from a single source?

Assuming this is the case:

What I'd do is still run metacells on all the cells. What you should expect is for each metacell to contain cells from a single source (at least, "mostly"). You can QC this as a single data set. If you see metacells that mix "cell types", you need to consider whether this is because they are very similar, or because the algorithm used marker genes which are "lateral" to your cell types (in which case, add them to the lateral genes mask and try again).

Once you are happy that the metacells are homogeneous "enough" (100% would be too much to hope for, given some cell types are very similar to each other), then you can run the MC algorithm separately for each cell type, using the same lateral genes mask as for the combined data set. This is 200 separate runs, but there wouldn't be any manual decisions required. You could then take your "pure cell type" metacells from these 200 runs and use them for the expensive downstream analysis.

All that said - I'm deeply suspicious that your 200 sources each really contains cells of exactly one of 200(!) cell types. I've no idea what protocol you are using, and I'm just a computational guy, not a real biologist - still, this seems very unlikely to me. If there's a possibility of sources containing "a few" cells of another type (which I think would be very likely), then I'd look at the metacells from the single combined run, assign each one a type (say, by the most frequent cell type they contain), and use these instead for the downstream analysis.

crotoc commented 11 months ago

Yes each cell type is from a single source so no need to deal with the batch effects. Actually the single cell data are tissue based, which means several cell types are from a single source. There are about 30 different tissues.

If I understand it correctly, I need to combine all the datasets even they have batch effects? Thanks! :)

orenbenkiki commented 10 months ago

As long as you don't have the same cell type in several sources, you can just run it all in one combined data set. Of course you should QC this extensively to ensure that the metacells do end up sufficiently homogeneous. Then you can just automate a loop that runs the MC algorithm seperately for each source, etc. I'd be interested to know how it works for you.

tanaylab / metacells

Is there a way to automate this pipeline? #61