ventolab / CellphoneDB

CellPhoneDB can be used to search for a particular ligand/receptor, or interrogate your own HUMAN single-cell transcriptomics data.
https://www.cellphonedb.org/
MIT License
305 stars 52 forks source link

Is there any way to run CellphoneDB using a specific cluster pair(s)? #106

Closed stanaka6 closed 7 months ago

stanaka6 commented 1 year ago

Hello,

Thank you so much for developing a wonderful tool!

I want to run the CellphoneDB Method 2 statistical analysis for my data, which has about 40k cells. However, I keep getting out-of-memory notices even if I set the memory as 1024 GB on the batch job using our institutional high-performance computing cluster.

Therefore, I was wondering if there is a way to preset, run the CellphoneDB, and compute scores only using specific combinations of clusters because I'm not interested in all pairs within my data.

The clusters in my data are like Tissue A-1, 2, ...30 + Tissue B-1, 2, ...10. I'm interested in the following combination: Tissue A-1 & Tissue B-1, 2, or ...10 Tissue A-2 & Tissue B-1, 2, or ...10 ...

I don't need to calculate the score within the same tissues (Tissue A vs Tissue A, Tissue B vs Tissue B), such as: Tissue A-1 & Tissue A-2 Tissue B-1 & Tissue B-2 ...

Is there any way to set the specific pair of interests when running method 2?

I can also subset my data by sets of clusters and run the cellphoneDB, which may help to resolve the out-of-memory issues.

Any suggestions and comments would be appreciated!

Thanks

ktroule commented 1 year ago

Thanks for using CellphoneDB.

To avoid running out of memory you have two options (or a combination of both):

Kind regards

stanaka6 commented 1 year ago

Thank you so much for answering my question, @ktroule! I am interested in using a microenvironment file rather than subsetting a certain percentage of the cells because some cell types of my interest contain a very small number of cells compared to major cell types.

I need some clarifications:

So if I want to only compare Tissue A clusters vs Tissue B clusters, but not within the same tissues, as described above, should I set the microenvironment like the below image?

Screenshot 2023-04-28 at 8 19 55 AM

I have another question:

Does method 2 use the entire dataset to calculate a statistical significance, not just two cell types' pairwise comparison? If using the entire dataset and I subset the data by cell types to run cellphone DB due to a memory issue, not the percentage of cells you suggested, should I correct the p values when I combine the data? I would also appreciate you could provide us with a paper describing the detailed statistics (sorry, I am not sure which one...).

Thank you for your help!

ktroule commented 1 year ago

Hi @stanaka6

After rereading your initial question, I find it strange that with 1024 GB of memory and "only" 40k cells you get the out-of-memory error. I also forgot to mention that method 2 has an inbuilt argument subsampling that performs a geometric sketching subsampling. This should subset your dataset without affecting the biological information of this. I would suggest using this option.

Microenvironment files do not affect the way the significance is calculated as all clusters are employed.

This is the reference paper: https://www.nature.com/articles/s41596-020-0292-x