morrislab / pairtree

Pairtree is a method for reconstructing cancer evolutionary history in individual patients, and analyzing intratumor genetic heterogeneity. Pairtree focuses on scaling to many more cancer samples and cancer cell subpopulations than other algorithms, and on producing concise and informative interactive characterizations of posterior uncertainty.
MIT License
37 stars 11 forks source link

Getting different tree solutions with different pairtree run. #51

Open itigupta2429 opened 4 months ago

itigupta2429 commented 4 months ago

Hello Team,

I've been working with pairtree to analyze some of the cancer samples from our cohort data. Since each patient has multiple samples, including primary and recurrent tissues, the resulting pairtree often appears linear with minimal branching and many nodes (more than 50).

I understand that adjusting the concentration parameter can help achieve proper branching. However, even with the same concentration parameter, I've noticed that pairtree produces different trees across different runs. I've tried setting the seed parameter in both the clustervars and pairtree steps, but unfortunately, this hasn't resolved the issue.

code used:

python3 $pairtree/clustervars --seed 5555 $ssm $json $output/PT-id_conc_minus2.json

python3  $pairtree/pairtree --params $output/PT-id_conc_minus2.json $ssm $output/PT-id_conc_minus2.npz --seed=5555

python3 $pairtree/summposterior --runid methods $ssm $output/PT-id_conc_minus2.json  $output/PT-id_conc_minus2.npz $output/PT-id_conc_minus2.summposterior.html

python3 $pairtree/plottree --runid methods $ssm $output/PT-id_conc_minus2.json  $output/PT-id_conc_minus2.npz  $output/PT-id_conc_minus2.plottree.html

Could you please assist me with this challenge?

ethanumn commented 4 months ago

Hi there --

I'm not able to reproduce any behavior where the pairtree or clustervars scripts output different results when given the same set of parameters and seed.

The concentration parameter for the clustervars script is the log of the concentration. If you provide a value that's greater than 0, both models in clustervars (linfreq, pairwise) should produce more clusters. However, it's not necessarily the case that having more clusters will lead to more branching.

Does the data imply that there should be multiple branching events? If so, you could have a data normalization issue. What type of sequencing was used?

itigupta2429 commented 4 months ago

My data is from a patient with 1 primary tissue & 2 recurrent tissues. I am expecting 2 early branches (for primary & recurrent) & 1 late branch (separating 2 recurrent tissues). This data is from whole genome sequencing (DRAGEN). I did not perform any normalization can you please suggest at what step should I do this? Also, could the different output be related to number of input variants?

ethanumn commented 3 months ago

Unless you have some reason to suspect there are a large number of clones (say 50 like you've mentioned), it would be difficult to confidently resolve 50 clones using only 3 samples. How many clones do you obtain when using the clustervars script with the default settings?

If you're varying the number of input variants, then this may change what you obtain for a clustering. This could also impact whether or not you see the branching you're expecting. I think some basic analysis of the copy number states and VAFs for the mutations you've included in the analysis should reveal whether or not a clone tree reconstruction method would resolve multiple branches. For example, if the majority of mutations only have non-zero VAFs in one sample, then this may result in one long linear branch. In this scenario, no amount of clustering or data normalization will likely change the trees output by Pairtree.

Also -- if the coverage is somewhat uniform from WGS then I wouldn't expect normalization to help.

I'm not sure how much more help I can be without seeing what you're obtaining as far as output from Pairtree.