hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
193 stars 59 forks source link

Documentation question: purple segment tsv column name meanings #321

Closed jamesdalg closed 2 years ago

jamesdalg commented 2 years ago

What precisely is the exact meaning for the following column names in purple.segment.tsv? Some are obvious like minor/major allele copy number, but some like the ratioSupport column I would like to know more about so that I can use the allele specific copy number in an analysis I'm doing. chromosome start end germlineStatus bafCount observedBAF minorAlleleCopyNumber minorAlleleCopyNumberDeviation observedTumorRatio observedNormalRatio unnormalisedObservedNormalRatio majorAlleleCopyNumber majorAlleleCopyNumberDeviation deviationPenalty tumorCopyNumber fittedTumorCopyNumber fittedBAF refNormalisedCopyNumber ratioSupport support depthWindowCount tumorBAF gcContent eventPenalty minStart maxStart

p-priestley commented 2 years ago

Hi James - the segment file show the initial segmentation of the genome (combining GRIDSS, COBALT and AMBER) which is used to do the fitting. After this the data is smoothed significantly to our final output, but we also output this file also so we can occasionally debug parts of the algorithm (eg a poor fit).

I strongly recommend to use the somatic CNV file which is explained fully here: https://github.com/hartwigmedical/hmftools/tree/master/purple#copy-number-file

Is there something missing from that file that you need for the analysis?