google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.17k stars 716 forks source link

Writing realigned reads causes storage issues #370

Closed nikostr closed 3 years ago

nikostr commented 3 years ago

I'm attempting to write a bam file of the realigned reads, as I'm seeing ADs in the vcf that do not line up with what is present in the input bam file. I'm mainly concerned with two specific locations in the genome. The full genome is approx 11 Gbp, the input bam file is about 190 GB, and the drive I'm attempting to output to has more than 4 TB available. When using -emit_realigned_reads and using -realigner_diagnostics to provide an output directory, the log file tells me that deepvariant attempts to write more than seven million bam files, making it impossible to access the directory before crashing due to running out of disk space. Is there some way of getting around this? I'm thinking either a more storage efficient way of getting the entire bam file, or a way of getting the bam file of the specific regions I'm interested in.

I'm using deepvariant 1.0 from docker on Ubuntu 18.04. The data is short read Illumina data.

The command I'm using to run this:

seq 0 $((60-1)) |\
parallel --halt 2 --line-buffer /opt/deepvariant/bin/make_examples \
--mode calling --emit_realigned_reads --realigner_diagnostics=results/sample/deepvariant/realigned \
--ref data/genome/reference.fasta --reads results/sample/aligned/sample.bam \
--examples results/sample/deepvariant/tmp/make_examples/make_examples.tfrecord@60.gz  \
--sample_name sample --task {} 2> results/sample/deepvariant/tmp/make_examples.log 
akolesnikov commented 3 years ago

HI,

Seven million realigned BAMs seems to be a right number. For your purposes you may use --regions parameter that would restrict make_examples to a specific region. For example --regions chr20:1000-1500 would generate BAM files for 500 bases. In addition, if you use --regions flag you may want to remove a shardining from output examples files name: make_examples.tfrecord.gz instead of make_examples.tfrecord@60.gz. And you don't need to run it with parallel, you just need to run one instance. --task parameter is also not needed when the output is not sharded. Something like this:

/opt/deepvariant/bin/make_examples \
--mode calling --emit_realigned_reads --realigner_diagnostics=results/sample/deepvariant/realigned \
--ref data/genome/reference.fasta --reads results/sample/aligned/sample.bam \
--examples results/sample/deepvariant/tmp/make_examples/make_examples.tfrecord.gz  \
--sample_name sample --regions chr20:1000-1500 2> results/sample/deepvariant/tmp/make_examples.log
gunjanbaid commented 3 years ago

@nikostr I'll close this issue, but feel free to reopen if you have any other questions.