Mean calculation - Githubissues

Leonhard2000 commented 2 years ago

Hi,

CPDB calculates the mean values in the output file "means.txt" as the mean between interaction partner_a & partner_b.

Problem_1: A high mean might be misleading if partner_a has a very high value and partner_b has a very low value. Is there really a strong biological interaction if there is plenty of ligand (partner_a) but nearly 0 expression of the receptor (partner_b)?

The most simple solution would be to filter out low expressed genes before CPDB analysis. But because CPDB tests for specifity and even low expressed interactions have an impact, we do not filter these genes out but rather do not use the mean of these interactions and instead mark them as "low" expressed. Would be great if a user-definded threshold could be implemented which returns a "low" mean instead of a number for all interactions where one partner is below the treshold.

Problem_2: For interaction partner with several assigned Ensembl_IDs like LTR4B (ENSG00000213903 & ENSG00000285456), CPDB calculates the mean rather than the sum. This will result in a decreased value of these genes and thus the mean of their interactions. It would be better to sum up all genes which code for the same protein.

Example for LTR4B: ENSG00000213903 = 2000 norm. counts ENSG00000285456 is not found and thus 0 LTR4B value would be 1000 which is used for the mean calculation of all LTR4B-interactions. This effect is worse for genes with more IDs like AGER (7 Ensembl-IDs).

luzgaral commented 2 years ago

Dear Leonhard2000,

Thank you for your insightful discussion.

Regarding the Problem_1, using the mean has advantages and disadvantages. You can set the -percent threshold to limit the inclusion of genes expressed by a low proportion of cells.

Regarding the Problem_2, this is a general issue aligning Ensembl-Uniprot. That you for raising, we will debate it. Meanwhile, you can translate Ensembl_IDs into gene symbols and use your preferred metric to aggregate multiple signals.

Best,

Luz

Leonhard2000 commented 2 years ago

Thanks Luz,

yeah that is a workaround for Problem_2. But in Problem_1 I mean the expression value (e.g. normalized counts, TPM, RPKM, ...) and not the % of cells expressing the gene. Example 1: 100% of cells express the ligand with a value of 500 and the receptor with a value of 500 CPDB mean would be 500 Example_2: 100% of the cells express the ligand with a value of 1000 and the receptor with a value of 2 CPDB mean would be 501 Interpretation: In my perspective the second interaction is not "stronger" than the first one and should be interpreted with caution because the receptor is nearly 0 which would mean no interaction.

My solution so far with Excel: Use input_counts and manually calculate each ligand and receptor value and give a "warning" if one/both are below a defined threshold. Problem: Not all researcher are aware of this problem and CPDB has all data to provide these "warnings" on its own if a threshold value is given.

ventolab / CellphoneDB

Mean calculation #66