Tumour purity, ploidy, and copy number variation from whole-genome sequence data
To run Ploidetect, you will need:
Ideally you can use the Snakemake workflow available at this repository
The outputs from Ploidetect are five files:
models.txt describes a number of inferred models to explain the tumor purity and ploidy of your sample. Most of the columns are gobbeldygook and need to be cleaned up, but the important columns are the tp and ploidy. The rows are ordered in the order the algorithm has scored it, so the first plot/row corresponds to the highest score.
plots.pdf is a multi-page pdf, where each page corresponds to a line in models.txt. This is intended to assist in review of the data. While great pains have been made to ensure that Ploidetect is highly accurate in model selection, it (and no tool, for that matter) is not 100% accurate and you might want to take a peek to make sure the model (red peaks) seems to agree with the distribution of observed read depths (black peaks). In tumors with a high degree of heterogeneity, the model might skip some irregular peaks.
cna.txt contains the copy number informtation. There are a number of columns contained therein:
column | definition |
---|---|
chr | chromosome |
pos | bin START |
end | bin END |
segment | segment number |
corrected_depth | corrected BIN depth |
segment_depth | mean SEGMENT depth |
maf | beta allele frequency at bin |
CN | fractional copy number of segment |
state | state (corresponds to cna_plots.pdf) |
zygosity | HET or HOM to designate predicted zygosity |
cna_plots.pdf is a multi-page pdf containing plots of the copy number profile, with one page per chromosome.
Currently there isn't much use for meta.RDS except for debugging.
Either the case has no CNVs or the data is too noisy at that depth level. If you're sure there are CNVs in the data, modify line 25 of the Snakefile, specifically here:
python3 scripts/make_windows.py - 100000
- the 100000 is the threshold for normal depth to create variable-width bins. The size of the bins depends on the germline coverage, and Ploidetect was developed for 40x normal, 80x tumor genomes. If the ratio of tumor to normal is not 2:1, adjust the threshold accordingly. So for 40x/40x you might decide to use a larger threshold of 200000 to account for the tumor genome having more noise.
You may have a bad estimate of tumor purity being fed to the copy number caller. Review the plots.pdf and models.txt files and find a model which explains the distribution of read depth better. Slice the row of models.txt with the correct model (row n):
sed 'n!d' models.txt > new_models.txt
and rerun the CNV caller with the new estimates:
Rscript scripts/ploidetect_copynumber.R -i data/{sample_id}/segmented.RDS -m new_models.txt -p /path/to/cna_plots.pdf -o /path/to/cna.txt &> /path/to/logfile.txt"