morrislab / pairtree

Pairtree is a method for reconstructing cancer evolutionary history in individual patients, and analyzing intratumor genetic heterogeneity. Pairtree focuses on scaling to many more cancer samples and cancer cell subpopulations than other algorithms, and on producing concise and informative interactive characterizations of posterior uncertainty.
MIT License
37 stars 11 forks source link

(document the input files) #2

Closed brucemoran closed 3 years ago

brucemoran commented 3 years ago

Hi,

I'd like to try pairtree but I am at a loss as to how to do so.

Can you give an example dataset, and an example command?

Thanks,

Bruce

jwintersinger commented 3 years ago

Hi @brucemoran,

Thanks for reaching out! I've added documentation about the input files to the README, and also added a section on testing your Pairtree installation that provides an example dataset and command.

I will continue to update the README with more information over the coming days, but please don't hesitate to ask if you have any other questions!

brucemoran commented 3 years ago

Hi @jwintersinger,

thanks for getting back and updating.

After some brief testing, I see that cluster mutations are made by clustervars with inputs ssm_fn, in_params_fn and out_params_fn. These in turn are respectively file.ssm as for the main pairtree function, a 'blank' cluster file format (as below), and a filename to hold the json output which is input to the main pairtree function.

echo '{"samples": ["Sample 1", "Sample 2", "Sample 3"], "clusters": [], "garbage": []}' > in_params_fn.json

Question then is regarding ssm inputs, I presume to use all mutations passing filters?

BTW I have made a Singularity container which runs and passes the example data (NB needed 'plotly' installed to output html): singularity pull shub://brucemoran/Singularity:pairtree.centos7.mamba

Thanks,

Bruce

jwintersinger commented 3 years ago

Hi @brucemoran,

You're exactly right! Typically, you should use all mutations that are of good quality. Pairtree's runtime will depend mostly on the number of clusters (i.e., subclones), not the number of mutations. I've tested with up to 10,000 mutations. The algorithm gives good results on simulated data with up to 30 subclones, and decent results at 100 subclones.

Thank you for putting together the Singularity container! I'll reference it in the README when I make my next documentation update.

Please let me know if you run into any issues. Pairtree's focus is on doing a really good job of tree search; the variant clustering algorithm is useful but less mature. There are other methods for clustering variants that would be worth trying if you find Pairtree's clusters aren't good on your data. In particular, PyClone-VI (https://www.biorxiv.org/content/10.1101/2020.08.31.276212v1.full) looks interesting, but I haven't tried it yet. It should be straightforward to translate clustering outputs from other methods into a format suitable for Pairtree.