A tool for large-scale CNV and CNA detection.
Authors: G. Demidov, S. Ossowski.
Thanks to the contributors!
Any issues should be reported to: german dot demidov at medizin dot uni-tuebingen dot de. The presentation (around 60 slides with the short description of ClinCNV and results) is available here.
This software is distributed under MIT licence.
ClinCNV is a part of MegSAP pipeline.
Where can I find Bioconda/Docker/Singularity/etc containerized version of ClinCNV? - you can find it nowhere, we run it via MegSAP pipeline and several other large medical genetic centers too. If you want to create a containerized version, I will say a big "thank you" and add a link to your repo here.
ClinCNV detects CNVs in germline and somatic context in NGS data (targeted and whole-genome). We work in cohorts, so it makes sense to try ClinCNV
if you have more than 10 samples (recommended amount - 40 since we estimate variances from the data). By "cohort" we mean samples sequenced with the same enrichment kit with approximately the same depth (ie 1x WGS and 30x WGS better be analysed in separate runs of ClinCNV). Of course it is better if your samples were sequenced within the same sequencing facility. ClinCNV
does not work with very small panels (tens of genes).
Currently we work with human genomes hg19
and hg38
only. For hg38
you need to turn on --hg38
flag, by default it is hg19! For mouse genome or any other diploid organism you have to replace cytobands.txt with the corresponding file. ClinCNV can work with small panels (hundreds of regions), but GC-correction can not be performed accurately for samples sequenced with such panels.
NOTE: Folder PCAWG
was used for CNVs detection in PanCancer Analysis of Whole Genomes cohort and is research only version. It is located here for historical reasons. Feel free to remove it.
bioRxiv for somatic CNA analysis: https://www.biorxiv.org/content/10.1101/837971v1 (calling of copy-number alterations in normal-tumor pairs).
For germline CNVs: https://www.biorxiv.org/content/10.1101/2022.06.10.495642v1 (calling of copy-number variants in germline).
For common CNPs: https://pubmed.ncbi.nlm.nih.gov/35597955/ (this is an advanced analysis and the tool is not working strictly out of the box, better contact me first)
The reason for the very high computation time is, most probably, writing out the visualization tracks for IGV. You can turn it off via --visulizationIGV
flag, if you are interested only in calls, not in the visual inspection.
We expect you to install ClinCNV on Linux or MacOS platforms. Copy all the files using git clone https://github.com/imgag/ClinCNV.git
. We expect you to install R
(as new version as possible, we used ClinCNV with R 3.2.3
, but you may experience problems installing libraries using the old version) and the following libraries:
install.packages("optparse")
install.packages("robustbase")
install.packages("MASS")
install.packages("data.table")
install.packages("foreach")
install.packages("doParallel")
install.packages("mclust")
install.packages("R.utils")
install.packages("RColorBrewer")
install.packages("party")
install.packages("dbscan")
install.packages("umap")
ClinCNV works faster with Rcpp
package installed, however, if you experience any problems with this package, you may run ClinCNV without it.
install.packages("Rcpp")
Test run:
fold=/folder/with/the/cloned/ClinCNV
mkdir $fold"/results"
Rscript $fold"/clinCNV.R" --bed $fold"/samples/bed_file.bed" --normal $fold"/samples/coverages_normal.cov" --out $fold"/results"
Attention: better use absolute paths.
If ClinCNV
fails with the test run, set chmod 755
to the output folder.
You will find result in $fold/result/
folder.
For now we do not provide our own tool for pre-processing of the data. We recommend you to use ngs-bits (https://github.com/imgag/ngs-bits), however, as soon as your data match the format expected by ClinCNV you may proceed with any tool of choice (eg, samtools).
You should also have .bed
file with the coordinates of targeted regions and reference genome in .fasta
format for annotation of .bed
file with ngs-bits.
More informative and updated manuals are located in the doc
folder. The order of reading: install.md, preliminary_steps.md, and then you may choose the type of analysis that is interesting for you - germline, somatic, or trios.
You can try to start ClinCNV as follows (after the preparation of files .cov
, .bed
):
Rscript clinCNV.R --normal normal.cov --out outputFolder --bed annotatedBedFile.bed --folderWithScript $PWD
for germline samples and for somatic as
Rscript clinCNV.R --normal normal.cov --tumor tumor.cov --out outputFolder --pair fileWithPairs.txt --bed annotatedBedFile.bed --folderWithScript $PWD
.cov
is a matrix of coverages (merged from many samples). .bed
file has to be annotated with GC-content (from 0 to 1, should be in 4th column).
ClinCNV now can work in 3 different contexts: germline calling, somatic calling (normal / tumor pairs) and trios. To run ClinCNV in germline context, you should minimally specify --bed
and --normal
. For trios, parameter --triosFile
has to be specified (sample names for a child, his/hers mother and father, divided by comma). For somatic --tumor
and --pairs
has to be specified (pairs should contain tumor and normal sample names, divided by comma). It is highly recommended to use B-allele frequencies in somatic context. A folder with .tsv
files has to be specified to switch to B-allele frequencies mode. Files need to have same sample names as column names in --normal
and --tumor
files. If ClinCNV does not find B-allele frequencies for a particular sample, it tries to detect CNVs using read depths only.
Tumor-only calling is under tests now. A flag --onlyTumor
has to be specified. Better be used with baf
files and off-target reads. 20 or more normal samples are recommended as a background. An example of tumor only running command (several tumors are already mixed with normals):
Rscript /path/to/clinCNV_test/ClinCNV/clinCNV.R --normal /path/to/merged_normal_coverage.cov --normalOfftarget /path/to/merged_offtarget_coverage.cov --bed /path/to/ontarget_gc_annotated_bed.bed --bedOfftarget /path/to/offtarget_gc_annotated_bed.bed --onlyTumor --bafFolder /path/to/folder_with_baf_files --folderWithScript /path/to/clinCNV_test/ClinCNV --out /path/to/results --normalSample DX190627_01 --minimumPurity 30 --purityStep 5 --scoreS 150
Current version of ClinCNV works with 3 possible types of data: on-target reads, off-target reads, B-allele frequencies. For WGS obviously we work with only 2 of them (on-target and B-allele). For shallow WGS only "on-target" coverage is informative.
To perform operations with these types of data, we specify several formats of files: annotated .bed
file, .cov
file, .tsv
file with information about BAF.
We expect .bed
file annotated with GC-content and (optionally) intersecting genes. Header should be removed or commented with # symbol.
chrI[char, "chr" is a prefix] \t startCoord[int] \t endCoord[int] \t gcContent[real, from 0 to 1] \t genesName[character comma delimited] \n
Example of .bed
(here and below we provide only one line, assuming that there are as many as needed):
chr1 12171 12245 0.4595
or, annotated with genes,
chr1 12171 12245 0.4595 DDX11L1
We expect AVERAGE coverage depths of samples to be written as (starting from header):
chr \t start \t end \t sampleName1 \t sampleName2 \n
chrI[char, "chr" is a prefix] \t startCoord[int] \t endCoord[int] \t averageCoverageDepth1[real] \t averageCoverageDepth2[real] \n
Example:
chr start end Sam1 Sam2
chr1 11166636 11166864 2374.32 1224.54
Note1: you may create such files for your samples separately and use the mergeFilesFromFolder.R
script to merge them together.
Note2: ngs-bits
calculates average coverage. If you use other tool and if you suffer a lot with calculating average coverage, but you have the raw coverage depths, you can change the function
gc_and_sample_size_normalise <- function(info, coverages, averageCoverage=T, allowedChroms=NULL)
to
gc_and_sample_size_normalise <- function(info, coverages, averageCoverage=F, allowedChroms=NULL)
in the file generalHelpers.R
.
Note3: on-target and off-target reads should be pre-processed in .cov
formats. If you do not have off-target reads for some samples - don't worry, ClinCNV will work with available data only, including off-target only for samples that have this data.
Note4: Please be sure that you do not round your coverage of shallow-sequenced samples too much (e.g., the average coverage of the region is 0.0005, and you round it to 0.00).
Note5: Names of columns (sample names) are meaningful and should match between normal.cov
, tumor.cov
, pairs.txt
files for somatic framework.
.tsv
)Without header:
chrI[char, "chr" is a prefix] \t startCoord[int] \t endCoord[int] \t uniqueID[char, can be used as chrom + coord merged] \t frequencyOfAltAllele[real from 0 to 1] \t coverageDepthOfThisPosition[int] \n
Example:
chr1 2488153 2488153 chr1_2488153 0.4913 289
Note1: despite the fact we have start and end coordinates, B-allele frequency are expected to be calculated only from SNVs, not from indels.
Note2: if you do not have B-allele frequencies for some samples - don't worry, ClinCNV will work with available data only.
We require presence of both normal and tumor input files to work in somatic context. To explain ClinCNV the connection between normals and tumors, you need to prepare file with the following format:
TumorSampleFromPatient1,NormalSampleFromPatient1
TumorSampleFromPatient2,NormalSampleFromPatient2
Please take care - sample names such as "TumorSampleFromPatient1" should match column name in .cov
files and file name in BAF TumorSampleFromPatient1.tsv
files (if you want to use B-allele frequencies for this sample). The file can have any extension, we use "pairs.txt" to name such files.
You will need a .bed
file with start and end of each chromsome. For hg19 lengths of chromosomes can be found at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes , just add 0s as a second column.
Segmentation of whole genome with ngs-bits:
BedChunk -in hg19.bed -n $sizeOfBin -out "preparedBedHg19.bin"$sizeOfBin".bed"
where $sizeOfBin means pre-specified size of the segment (see below).
Assume you have a .bed
file $bedFile. This is how you create offtarget .bed:
# Determine offtarget with offset of 400 to the left and to the right of the targeted region
BedExtend -in $bedFile -n 400 | BedMerge -out "extended_"$bedFile
# hg19.bed here is from the previous paragraph
BedSubtract -in hg19.bed -in2 "extended_"$bedFile -out offtarget.bed
# Chunk offtarget into pieces of 50kbps
BedChunk -in offtarget.bed -n 50000 -out offtarget_chunks.bed
# Remove regions <25k
BedShrink -in offtarget_chunks.bed -n 12500 | BedExtend -n 12500 -out "offtarget_chunks_"$bedFile
To use ClinCNV in WGS and off-target contexts you need to choose a size of the window you want to segment your genome with.
From our experience, if you have shallow coverage (0.5x on average) - the window size should be 25kb at least, for 30x 1kb windows are totally OK and you can probably go to smaller window sizes. But if you have something intermediate, the criteria to choose the window size of off-target reads is: 1) not a lot of zero coverage regions (then the distributions will become zero-inflated and ClinCNV's results will be inaccurate), 2) approximate normality of coverage. To check the 2nd assumption, we recommend you to choose your desired window size, calculate coverage for ~30 samples, choose like 10-20 regions on random from the autosomes and built a density plot (plot(density(coverages)
) in R
). If you will see something that does not even remind you a bell shape, but has a large tail - you should increase the window size.
BedAnnotateGC -in $bedFile -out "gcAnnotated."$bedFile
BedAnnotateGenes -in "gcAnnotated."$bedFile -out "annotated."$bedFile
rm "gcAnnotated.""$bedFile
With ngs-bits:
BedCoverage -bam $bamPath -in $bedPath -min_mapq 3 -out $sampleName".cov"
With samtools:
samtools bedcov $bedFilePath -Q 3 $bamPath > $sampleName".cov"
If you have a VCF file, you can use BAF extractor script (thanks to Timofei for refactoring). I can not guarantee if it will work fine with your VCF (if your VCF contains the required info), report me if it does not.
Using ngs-bits:
VariantAnnotateFrequency -in $nameOfNormalSample".vcf" -bam $nameOfSample".bam" -out $nameOfSample".tsv" -depth
grep "^[^#]" $nameOfSample".tsv" | nawk -F'\t' '(length($4) == 1) && (length($5) == 1) {print $1 "\t" $2 "\t" $3 "\t" $1 "_" $2 "\t" $(NF-1) "\t" $NF}' > $nameOfSample".tsv"
Note1: ClinCNV uses BAF files only in somatic context now so you have to perform this procedure for both somatic and normal samples.
--normal
- file with normal samples' coverages depths in .cov
format. On-target or WGS coverage.
--tumor
- file with tumor samples' coverages depths in .cov
format. On-target or WGS coverage.
--normalOfftarget
- file with normal samples' coverages depths in .cov
format. Off-target coverage.
--tumorOfftarget
- file with tumor samples' coverages depths in .cov
format. Off-target coverage.
--out
- output folder, default: ./result/
--pair
- comma separated table. Each row = "tumorSampleName,normalSampleName"
--bed
- .bed
file with coordinates of on-target regions, GC-annotated. Gene anotations in the last column can be used in output.
--bedOfftarget
- .bed
file with coordinates of off-target regions, GC-annotated. Gene anotations in the last column can be used in output.
--colNum
- number of column in .cov
file where coverage depth columns start. Normally equal to 4.
--folderWithScript
- obsolete. Basically it was $PWD
--reanalyseCohort
- if equal to "T", all the samples are analysed. If equal to "F", only samples which subfolders do not exist in output folder will be analysed
--scoreG
- germline CNV score threshold (loglikelihood difference). For additional information: https://en.wikipedia.org/wiki/Bayes_factor#Interpretation . As a guideline: 30 is usually a sensitive, but not specific option, 50 is balanced, 100 is quite strict and leads to high specificity, but (possibly) low sensitivity.
--lengthG
- minimum length of germline CNV (number of markers = on- and off-target regions). Depends on your desired purposes. Recomended to keep up not smaller than 2. Due to historical reasons, actual number of data points will be bigger by one (e.g., if you put --lengthG 0
, the actual number of datapoints will be 1).
--scoreS
- somatic score threshold. Since CNAs (copy-number changes in caner) are usually long and quite significant, but FFPE introduce huge noise into the tumor sequencing data, we do not recommend to keep it lower than 50. As a rule of thumb, we use threshold of 100 for panel of ~500 cancer genes and 200 for WES samples.
--lengthS
- minimum length of somatic CNA (# of markers, for WES it can be ~5-10 since CNAs are usually long).
--maxNumGermCNVs
- maximum number of allowed germline CNVs. If sample has too many variants, thresholds will be automatically increased and the sample will be reanalysed several times.
--maxNumSomCNAs
- maximum number of allowed somatic CNAs.
--maxNumIter
- maximum number of iterations "increase threshold - detect CNVs - check if the number of detected variants less than specified above numbers"
--bafFolder
- folder with .tsv
files containing information about B-allele frequencies in tumor and normal samples. Note: both tumor and normal sample have to be presented if you want to use it as a predictor.
--normalSample
- name of the sample (if only one germline sample is expected to be calculated). All the parameters are estimated for the whole cohort. Has to be presented in a header of the file with normal samples. Otherwise, the tool will fail with assert message.
--tumorSample
- name of the tumor sample (if only one sample is expected to be calculated). Has to be presented in 1) file pairs.txt in pair with the sample specified by --normalSample option, 2) in a header of the file with tumor samples. Otherwise, the tool will fail with assert message.
--numberOfThreads
- maximum number of threads allowed to use by the tool.
ClinCNV is not published for now so it is not possible to properly cite the paper. However, you can:
1) cite it as an unpublished tool in the text of your paper;
2) ask us to help you with the analysis for the co-authorship or acknowledgement.
To cite the preprints - the links to bioRxiv are above, please do not mix somatic and germline preprints!