Datafiles and scripts for training and running of the CNV caller.
Each file is a single comressed numpy array with the following dtype in the only 'values' key:
dtype=[('chromosome', 'i1'), ('bins_loess', '<f2')]
and each array has length of 154,794, since the used bin size is 20,000 (hg19 reference). The "columns" are:
Thus, the loess corrected read count for the whole chromosome 1 can be obtained as:
import numpy as np
sample_npy = np.load('/path/to/sample/values.npz')['values']
read_count_chr1 = sum(sample_npy[sample_npy['chromosome']==0]['bins_loess'])
You need to have installed the following python packages to perform training (tested under python 3.5.2):
pip install numpy
pip install pandas
pip install dill
pip install sklearn
Use following steps to do the training:
python create_pca.py
This command will create a file pca_c20.pck
in the data
directory, where the first 20 principal components are stored.
python add_pca.py
This command will create a file XXXX_pca.npz
for every XXXX.npz
data file. These new files contain one new column ('bins_PCA', '<f2')
, which contain PCA normalized bin counts similarly as bins_loess
.
python train_means.py
This command will create means.npy
and means.tsv
files, which contain means and variances of bin counts (both loess and PCA normalized). Both files contain the same information.
More info about the training is in the article.
You need to have Rscript
installed (tested with R version 3.2.3):
sudo apt-get install r-base
and you also need the DNAcopy
R package (installation in the R shell):
source("http://bioconductor.org/biocLite.R")
biocLite("DNAcopy")
Furthermore, you need to have installed the following packages to run cnv_caller.py
:
pip install numpy
pip install pandas
pip install rpy2
Python version 3.7 is recommended, 3.6 also works well and for version 3.5 you will need rpy2 package of version 2.8.6 (pip install rpy2==2.8.6
).
Then you can run the CNV caller with the following:
python cnv_caller.py -m {trained_means} -b {bins_with_PCA} -o {output_dir} -k -f {fetal_fraction}
where -m
points to the trained means (.npy), -b
to the bins with PCA columns created with the add_pca.py
script, -o
points to the directory where output should be stored, -f
specifies the fetal fraction of the sample and -k
switches on filtering of bins. You can find more options by running python cnv_caller.py --help
.
The CNV calling command will create two files in the {output_dir}
directory:
{name}_cbs.txt
- cnv calling result - TSV table with columns:
start length end chromosome level effective_length significant color expected_ff
0 820000 12419 249200000 0 0.0017 10713 0 green 0.0004
1 20000 12149 243000000 1 -0.0097 11442 0 green 0.0025
2 100000 9886 197820000 2 0.0047 9596 0 green 0.0012
3 80000 9535 190780000 3 0.0047 9158 0 green 0.0012
4 40000 9032 180680000 4 -0.0006 8611 0 green 0.0002
...
{name}_gaps.txt
- filtered bins, the format follows a similar pattern as the cbs file, an example of first few lines:
chromosome start length end full_start
0 0 41 820000 0
0 1400000 1 1420000 1400000
0 1440000 1 1460000 1440000
0 1560000 2 1600000 1560000
...