kircherlab / cfDNA

cfDNA analysis workflow
MIT License
20 stars 6 forks source link

Correlation and rank files #13

Closed pageale closed 2 months ago

pageale commented 3 months ago

Hello again! I was able to run the script and everything was correct ! The problem that I face is that I trying to interprect was is the biological meaning of the correlation and rank values. Despite reading both publications related to tissue or origin Cell type signatures in cell-free DNA fragmentation profiles reveal disease biology and Cell-free DNA Comprises an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin , I think I am mixing concepts or not fully understand.

First, related to correlation values per tissue / cell, the more negative means that it is present in the cfDNA, rigth? imagen

For instance, in this case, it would be correct to say that there is signal of presence (or high expression) of eosinophils and neutrophils, or not?

Secondly, regarding the diffRank, I think this is the only step in the workflow that calls the ref_sample (the one added in the config file for each sample) and it is subtracted by sample (rank_ref - rank_sample) according to the R script. What is the purpose of the ref? has it to be necessarily a healthy ref or not? and the diffRank would be the level of change related to its reference? imagen Here, it is the same sample but ordered by its rank. So, it would be correct to say that the eosinophils kept its expression but for neutrophils augmented compared to its reference?

Thank you beforehand and waiting for your reply !

sroener commented 3 months ago

Hi @pageale

The workflow and its scripts are based on Snyder et al. 2016. The gene expression workflow mostly focuses on the results show in Figure 5.

Regarding your first question, the reason for the negative correlation is based on results shown in Figure 5E. In the plot you see the correlation with gene expression values of the Human Protein Atlas based on the the inferred nucleosome distance (inferred by FFT). The strongest (negative) correlation was shown for a range between 190-200 bp. This means that the tissues with the strongest negative correlation are the tissues inferred to contribute the most to the cfDNA signals. Just keep in mind that the correlation is still relatively low and that they can be close to each other (e.g. classical and intermediate monocyte).

Regarding your second question, the reference is whatever you want to compare your samples against. From a methodical perspective it does not matter what your reference is. The reason for picking a healthy reference is solely based on the experimental design. Depending on the experimental design, you could also compare two different phenotypes (e.g. cancer A and cancer B, or two different phenotypes in the same organ). Ultimately, it depends on you to make sense of the signals. Regarding the example, one could interpret the results as you did. Just keep in mind that they are sorted by rank change, and that a change in rank might not tell you a lot about the change in correlation. Reusing the prior example (classical and intermediate monocytes), minimal changes to their correlation might already lead to a rank difference.

I hope I could answer your questions.

sroener commented 2 months ago

Closing the Issue, feel free to reopen if there are still questions.