marcelTBI / CNV_data

Data for training of CNV caller
Other
4 stars 2 forks source link

Data processing problem #6

Open Sondr11 opened 5 months ago

Sondr11 commented 5 months ago

Hi. Thank you for the CNV_data tool. I have read it and understand that it is a very good tool to call cnv variation from nipt data. However, I'm a bit confused about the process of preparing data before training PCA as well as before cnv_caller. I plan to use 1000 samples to train PCA. but my input data is fq.gz. I see that the CNV_data tool uses modified npz files as input data. Can you give me specific instructions on how to create an npz file and how to modify the npz file according to the instructions in the readme? Thank you so much!

marcelTBI commented 5 months ago

Hi, first of all, you need to map all the reads to reference (we have used hg19, but I think any other reference would work). Then bin reads according to their read start to 20k bins (make sure that all are equal length with trailing bins with size 0). Optionally, use loess GC correction to modify the values and correct for GC bias - this is likely required to predict CNVs with precision. Finally, when you have bins for each chromosome, store them into a .npz file as:

# this snippet assumes that you have the counts of reads per bin stored in a bins_loess and chromosome numbers (counted from 0) in bins_chroms per bin for example:
# bins_loess = np.array([1.0, 1.0, 3.0, .... , 5.1, 4.0, 1.0])
# bins_chroms = np.array([0, 0, 0, ....., 23, 23, 23])
binarray = np.empty((len(bins_loess),), dtype=[('chromosome', 'i1'), ('bins_loess', 'f2')])
binarray['chromosome'] = bins_chroms
binarray['bins_loess'] = bins_loess
np.savez_compressed("path_for_output.npz", values=binarray)

I hope this is sufficient for you as the "vaguely" described steps (mapping, loess GC correction) are pretty flexible and can be done with a variety of tools. Binning is then only a simple grouping of reads according to their starts.

Sondr11 commented 5 months ago

Hi. I am using bowtie2 and samtools to create bam file. What tools should I use next and how. Can you guide me in detail with the tool you are using from the bam file until creating the npz file Thank you so much