morrislab / phylowgs

Application for inferring subclonal composition and evolution from whole-genome sequencing data.
GNU General Public License v3.0
107 stars 54 forks source link

Explanation of input data is unclear #102

Open luciansmith opened 5 years ago

luciansmith commented 5 years ago

I'm exploring working with PhyloWGS, and can't quite understand what's going on with the CNV inputs. The docs from the front page say:

cnv_data.txt: Note that if you are running without any CNVs, this file should be empty. You can create the empty file via the command touch cnv_data.txt.

cnv: identifier for each CNV. Identifiers must start at c0 and increment, so the first data row will have c0, the second row c1, and so forth. a: number of reference reads covering the CNV. d: total number of reads covering the CNV. This will be affected by factors such as total copy number at the locus, sequencing depth, and the size of the chromosomal region spanned by the CNV. ssms: SSMs that overlap with this CNV. Each entry is a comma-separated triplet consisting of SSM ID, maternal copy number, and paternal copy number. These triplets are separated by semicolons.

The actual cnv-data.txt file has:

cnv a d ssms physical_cnvs c0 66023,50883,62757,36056,58777 126755,100469,121941,71263,115417 s2,1,2;s4,0,1 chrom=1,start=1234,end=5678,major_cn=2,minor_cn=1,cell_prev=0.8;chrom=X,start=15,end=10000,major_cn=2,minor_cn=0,cell_prev=0.8;chrom=22,start=123,end=456,major_cn=1,minor_cn=0,cell_prev=0.8

The most obvious difference is that the example file has an extra input column, 'physical_cnvs'. I see from the commit history that this is a renamed 'comment' section, but it's unclear if it's actually parsed anywhere, or how it's used.

Beyond that, I am confused about what the data might mean, or if they don't mean anything, and the file is just there as a format example, and not as a content example.

The 'a' and 'd' values are the same as are in the SSM sample input for s0, which has:

id gene a d mu_r mu_v s0 a 66023,50883,62757,36056,58777 126755,100469,121941,71263,115417 0.999 0.5 s1 b 71532,50933,64719,52048,83311 135031,97485,120826,101788,157737 0.999 0.5 s2 c 93057,61406,72640,61648,54961 106716,83179,83799,86440,84607 0.999 0.5 s3 GOLGA7B 31531,40281,18089,34256,36473 72221,84463,37410,73050,76869 0.999 0.5 s4 KHNYN 94954,74636,67529,42328,59678 116386,106518,83271,66152,94465 0.999 0.5

s0 is, however, not included in the 'ssms' column. This is the first thing that makes me suspect that you were just going for format and not content; otherwise you would have made it similar to (or identical to?) s2 and/or s4.

However, that brings up my next question, namely, that s2 and s4 are both listed as being in the same CNV (c0), but have different maternal/paternal copy number calls. I don't understand what this is supposed to mean--is this modeling a single event that happens at multiple sites, that results in disparate numbers of copies of the genome in different places? That would seem to be the case from the 'physical_cnvs' column, too, since it lists three different chromosomes (1, X, and 22) with changes. (Though this brings up another question: why are there three chromosomes, but only two SSMs? Wouldn't you normally have many SSMs per chromosomal alteration? I would expect at least one per segment...)

Then we get to the question of differences between samples. I assume that this is what the 'a' and 'd' columns are for: that there are 5 numbers each because there are 5 samples from the same patient. But then there is only one number for the maternal/paternal copy number. Is the assumption that if there's a gain at s2 (1,2), the samples with that gain will have commensurate increases, and those without will not? And are we assuming that no CNV ever got a double hit? I.e. a single loss in one lineage (0,1), and a subsequent further loss in a child lineage (0,0) that different samples show?

But then if a single CNV involves multiple chromosomes, each of which may have a gain or loss, surely the 'a' and 'd' columns would be different for those different sections?

I guess all this is to say that in the end, I have no idea what the numbers in the example files are supposed to represent, or if they actually represent nothing more than formatting.

pawelqs commented 3 years ago

@luciansmith, did you found answers to your question?

LiliyaBasharova commented 2 years ago

@luciansmith @pawel125 Do you have any solution?

luciansmith commented 2 years ago

No, I never found an answer, and ended up using a different tool entirely as a result.

luciansmith commented 2 years ago

Actually, wait, I lie--we did try other software, but in the end, had to do things by hand with a combination of python scripts and drawing phylogenies on a chalkboard. (In our system, double hits were very common, which none of the software we looked at allowed.)

https://onlinelibrary.wiley.com/doi/10.1111/eva.13125

(In particular, Appendix A describes our approach.)