How to handle "cellular_prevalence" in cnvs.txt if missing data

ozill12 commented 6 years ago

Hi, I am trying to run phyloWGS with both ssm and cnv data, but the copy number tool I am using (sequenza) does not provide estimates of clonal fraction/cellular prevalence per segment. Would it be appropriate the use the estimated cellularity (fraction of tumor cells in the sample) value as a constant/dummy value in this column of cnvs.txt when running the parser, or what do you recommend doing when cellular prevalence per segment is missing? Thanks, Oliver

juvejones commented 6 years ago

I have the same issue here. Sequenza doesn't really provide segment-specific cellular prevalence per my understanding. Using purity as a substitute for prevalence is basically assuming all segments are equally clonal but I am not sure about the soundness.

jwintersinger commented 6 years ago

Hi @ozill12,

Sequenza doesn't seem to do subclonal copy number calling, so just providing the cellularity estimated by Sequenza should be okay. We won't be able to correct for subclonal copy-number changes' effects on mutation frequency, but that's still better than running without copy-number calls.

Please let us know if we can help with anything else!

ozill12 commented 6 years ago

Hi @jwintersinger -

Thanks very much for your reply! As an update, I have run phyloWGS on a cohort of tumor/normal pairs (n=200) using Strelka for SSM calling and Sequenza for allele-specific copy number calling, and I then passed the SSM + CNV data to phyloWGS using two different approaches to the cellular_prevalence variable in the cnvs.txt files.

approach1_nocolor.pdf

Approach 1 is as described in my initial note, where cellular_prevalence = tumor_cellularity from Sequenza. In this case, I noticed a strong correlation between the maximum phi per sample and the cellular_prevalence I assigned to that sample in the CNV data (see figure below). In many cases (124/200), maximum phi was very close to the CNV "cellular prevalence", suggesting that the CNVs ("prior CNV phi") may be weighted disproportionately to the SSMs. Is that behavior consistent with your expectations? What worries me is the fact that for many samples, phyloWGS has determined that there are no fully clonal variants (e.g., the samples with maximum phi < 0.6), whereas I was expecting there to be at least a few truncal SSMs in the vast majority of these samples (based on some biological assumptions, not on an actual prior analysis of these data). In some cases, the maximum phi is very low, which seems problematic.

approach1.pdf

For Approach 2, I assigned the CNV cellular_prevalence a constant value of 1 across all samples, which in effect assumes that all CNVs are fully clonal in each case. Comparing maximum phi for all samples in this run with the previous ones, you can see that for most samples (138/200), maximum phi shifts over to be nearly 1, with some samples at top left (blue dots) showing no change, and others showing intermediate effects on maximum phi. I have to look more closely at whether the tree topologies have shifted between Approach 1 and 2, but at first glance some samples do appear to show changes in their number of nodes (nChildren) and their relative placement.

approach2.pdf

Dot coloring is artificial, to allow dot tracing from Approach 1 to 2, and is based on |cellularity - max_phi| being > 0.05 (blue) or < 0.05 (green) in Approach 1.

I would be interested to hear your thoughts. I'm working on getting TitanCNA up and running to use instead of Sequenza, but the Titan upstream workflow is not as easy to use and the documentation is unnecessarily convoluted.

I am a fan of the Morris lab's work, I think the Deshwar et al 2015 Genome Biology paper is brilliant, and it's great that you have made your tools open and easy to use, so thanks again for your help. Best regards,

Oliver

morrislab / phylowgs

How to handle "cellular_prevalence" in cnvs.txt if missing data #71