morrislab / phylowgs

Application for inferring subclonal composition and evolution from whole-genome sequencing data.
GNU General Public License v3.0
108 stars 54 forks source link

important SSMs not clustering? #51

Open dtm2117 opened 7 years ago

dtm2117 commented 7 years ago

Sometimes I notice that the genes of most interest that I would like to determine where they fall within the clonal landscape are not incorporated into clustering.

The mutation is in the VCF file that I pass to the parser, but then are not generated into an SSM.

How are mutations from the VCF determined to be included in the clustering? and is there a way to force the clustering of certain important mutations?

Specifically, I'm working with multiple time point data and cannot locate an NRAS mutation, which is present in all timepoints.

Thanks!

quaidmorris commented 7 years ago

Is the NRAS mutation in a CNA region? That's probably why you can't find it.

Quaid Morris, PhD Associate Professor, The Donnelly Centre Departments of Molecular Genetics and Computer Science 160 College St, Rm 616 Toronto ON, M5S 3E1 Canada http://morrislab.med.utoronto.ca cell: (416) 220 5796

On Tue, Feb 14, 2017 at 1:51 PM, dtm2117 notifications@github.com wrote:

Sometimes I notice that the genes of most interest that I would like to determine where they fall within the clonal landscape are not incorporated into clustering.

The mutation is in the VCF file that I pass to the parser, but then are not generated into an SSM.

How are mutations from the VCF determined to be included in the clustering? and is there a way to force the clustering of certain important mutations?

Specifically, I'm working with multiple time point data and cannot locate an NRAS mutation, which is present in all timepoints.

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/morrislab/phylowgs/issues/51, or mute the thread https://github.com/notifications/unsubscribe-auth/AFGUdgkLayO63aZf6dsAIhqWBVrZSxYRks5rcfeUgaJpZM4MA2cl .

dtm2117 commented 7 years ago

Even if it was within a copy number altered region, wouldn't it show up in the SSM_data.txt file which is generated by the parser(create_inputs)? The CNV_data.txt which is also generated includes specific SSMs within each CNA but they are all found in the SSM_data.txt

I understand if the SSM was in a CN altered region I'd have to extract the clone data for that SSM from a different area of the .json output. But I'm not finding the mutation at all in the SSM_data.txt file which is generated pre-running evolve.py

I do see that you can give a list of mutations that you want priority for when the inputs are generated, is this the only way to ensure that genes of interest to us are clustered?

dtm2117 commented 7 years ago

After trying everything again using a prioritySSM file including all known cancer genes (including NRAS), it is still not incorporated into the SSM_Data file that is input into evolve.py from the parser. Any Ideas on why? (it is not in a CNV region)

quaidmorris commented 7 years ago

How many SSMs are you using? Are you using subsampling? The priority SSM list is for when you are subsampling SSMs.

The other possibility is that the SSM is in a region with multiple subclonal CNAs.

Q

Quaid Morris, PhD Associate Professor, The Donnelly Centre Departments of Molecular Genetics and Computer Science 160 College St, Rm 616 Toronto ON, M5S 3E1 Canada http://morrislab.med.utoronto.ca cell: (416) 220 5796

On Fri, Feb 24, 2017 at 3:11 PM, dtm2117 notifications@github.com wrote:

After trying everything again using a prioritySSM file including all known cancer genes (including NRAS), it is still not incorporated into the SSM_Data file that is input into evolve.py from the parser. Any Ideas on why? (it is not in a CNV region)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/morrislab/phylowgs/issues/51#issuecomment-282392354, or mute the thread https://github.com/notifications/unsubscribe-auth/AFGUdowPDTRGaD376wc_Ts5H5a3hHZXrks5rfzlrgaJpZM4MA2cl .

dtm2117 commented 7 years ago

It seems that the SSMs are within regions of multiple subclonal CNVs. This is multiple timepoint data, and while each timepoint has has a CNV which contains the NRAS mutation, where some populations of cells remain diploid and others have amplification, the magnitude of the magnification across timepoints is variable.

I assume this is why the SSMs are filtered out?

If I were to run this using the --regions all parameter this would compromise the accuracy of the results?

A final question: When passing of .VCF to the parser, is it best to keep as many variants as possible(as long as they pass QC of course), or should we limit the mutations to only deleterious or other "important" classes? I assumed that having more variants, even intron and IGR info, would be more informative for the clustering.

great tool!

dtm2117 commented 7 years ago

Would overlapping CNAs across timepoints cause for a SSM to be dropped?

for example, in 3 timepoint data, if a region of chr 1 has a CNA in timepoint 1, and CNA in timepoint2, and a CNA in timepoint 3, but only a single CNA in that region for each time point, aka some diploid cells and some altered cells for each timepoint but not necessarily the same exact region or copy number alteration, would this cause the SSM in that region to be filtered out.

I thought the issue with "subclonal" CNAs was that their temporal nature could not be determined thus making it impossible to determine the appropriate VAF for SSMs within that region. In the context of multi-timepoint data however the temporality is known.

I'm really trying to understand why the SSMs I want to visualize, which are present in the SSM input, are not being clustered.

Thanks again.

marcRDM commented 7 years ago

Hi there,

@dtm2117 Any insight on this issue? I am struggling too. I am very happy with the trees that PhyloWGS produced and I want to move forward by annotating those trees with important driving events that were manually curated by my colleagues.

As you pointed out, not only the priority.txt file does not work to include those when you subsample but even running the parser on the entire Strelka vcf (up to 50,000 variants) that includes them all does not include them in the ssm_data.txt created by create_phylowgs_inputs.

@quaidmorris I really love your work and I enjoy working with PhyloWGS, it is really frustrating to be stuck at this point, so close to provide interpretable trees.

Thank you very much in advance, Marc