morrislab / pairtree

Pairtree is a method for reconstructing cancer evolutionary history in individual patients, and analyzing intratumor genetic heterogeneity. Pairtree focuses on scaling to many more cancer samples and cancer cell subpopulations than other algorithms, and on producing concise and informative interactive characterizations of posterior uncertainty.
MIT License
33 stars 10 forks source link

Adding CN to VCF file #46

Closed ElizabethBorden closed 1 month ago

ElizabethBorden commented 8 months ago

Hello,

Can you suggest the best method to create a VCF file that incorporates copy number data? I see that is required to accurately make the .ssm file, but I cannot tell how this annotation was added - can you suggest a software that works well upstream of the ssm_base_converter.py script?

Thank you!

-Elizabeth

ethanumn commented 8 months ago

Hi Elizabeth,

The ssm_base_converter.py file provides a nice framework which can be subclassed for your particular use case. If you have a VCF file and another file containing copy number calls, you could rewrite part of the vcf_to_ssm.py script to take in that second file, and process its contents in the p_var_read_prob function to generate the correct values for the var_read_prob column in the ssm file. This might be the easiest thing to do.

ElizabethBorden commented 7 months ago

Hello,

Sorry for the slow response, I was trying a few more things to get this working on my end but am still struggling. What software do you typically use to get allele-specific copy number calls? I cannot seem to get a format working to integrate using the ssm_base_converter.py. Would you be willing to share the set of software you used to create the VCF file that you input into ssm_base_converter.py?

Thanks!

ethanumn commented 6 months ago

Hi Elizabeth,

Additional VCF format fields were added to the example VCF file for compatibility with our example script, it wasn't generated by another tool. There are a lot of different tools for calling allele specific copy number, and the correct tool will depend on your data. Two popular tools are FACETS (https://github.com/mskcc/facets), and CNVkit (https://github.com/etal/cnvkit). The outputs from these tools can be used to generate the variant read probabilities. If your data consists of only diploid regions you do not believe are impacted by CNAs, you can just set var_read_prob to 0.5 for all mutations in each sample.