nanoporetech / medaka

Sequence correction provided by ONT Research
https://nanoporetech.com
Other
391 stars 73 forks source link

Empty vcfs when running medaka_haploid_variant command #491

Closed wojciech-galan closed 4 months ago

wojciech-galan commented 4 months ago

Command run:

medaka_haploid_variant -i fastq -r GCF_008632635.1_ASM863263v1_genomic.fna.gz

GCF_008632635.1_ASM863263v1_genomic.fna.gz is bgzipped Acinetobacter baumanii reference sequence from ncbi, fastq directory contains simmulated reads generated from this reference (I have double checked - the reads map to the reference with NCBI blastn)

result - vcf files that contain header, but no variants.

version: I've used docker image ontresearch/medaka:v1.11.3 (but tried several other tags and unofficial images with the same result), no GPU support

medaka_haploid_variant -i /keep/c9d4a8f0579887f2060ac1ea5673065a+386/fastq -r /var/spool/cwl/c9d4a8f0579887f2060ac1ea5673065a+386/GCF_008632635.1_ASM863263v1_genomic.fna.gz
Cannot import pyabpoa, some features may not be available.
Cannot import pyabpoa, some features may not be available.
Cannot import pyabpoa, some features may not be available.
[10:24:22 - MdlStrTF] Successfully removed temporary files from /tmp/tmp3l6dxx7m.
Cannot import pyabpoa, some features may not be available.
Cannot import pyabpoa, some features may not be available.
[10:24:24 - MdlStrTF] Successfully removed temporary files from /tmp/tmps2xs_fml.
Creating fai index file /var/spool/cwl/c9d4a8f0579887f2060ac1ea5673065a+386/GCF_008632635.1_ASM863263v1_genomic.fna.gz.fai
Creating mmi index file /var/spool/cwl/c9d4a8f0579887f2060ac1ea5673065a+386/GCF_008632635.1_ASM863263v1_genomic.fna.gz.map-ont.mmi
[M::mm_idx_gen::0.143*1.00] collected minimizers
[M::mm_idx_gen::0.203*0.99] sorted minimizers
[M::main::0.239*0.99] loaded/built the index for 2 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 2
[M::mm_idx_stat::0.248*0.99] distinct minimizers: 722157 (98.17% are singletons); average occurrences: 1.031; average spacing: 5.348; total length: 3980230
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -I 16G -x map-ont -d /var/spool/cwl/c9d4a8f0579887f2060ac1ea5673065a+386/GCF_008632635.1_ASM863263v1_genomic.fna.gz.map-ont.mmi /var/spool/cwl/c9d4a8f0579887f2060ac1ea5673065a+386/GCF_008632635.1_ASM863263v1_genomic.fna.gz
[M::main] Real time: 0.256 sec; CPU: 0.254 sec; Peak RSS: 0.041 GB
[M::main::0.046*0.95] loaded/built the index for 2 target sequence(s)
[M::mm_mapopt_update::0.057*0.96] mid_occ = 12
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 2
[M::mm_idx_stat::0.066*0.96] distinct minimizers: 722157 (98.17% are singletons); average occurrences: 1.031; average spacing: 5.348; total length: 3980230
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -x map-ont --secondary=no -L --MD -A 2 -B 4 -O 4,24 -E 2,1 -t 1 -a /var/spool/cwl/c9d4a8f0579887f2060ac1ea5673065a+386/GCF_008632635.1_ASM863263v1_genomic.fna.gz.map-ont.mmi /keep/c9d4a8f0579887f2060ac1ea5673065a+386/fastq
[M::main] Real time: 0.071 sec; CPU: 0.068 sec; Peak RSS: 0.028 GB
Cannot import pyabpoa, some features may not be available.
[10:24:34 - Predict] Setting tensorflow inter/intra-op threads to 1/1.
[10:24:34 - Predict] Processing region(s): NZ_CP043953.1:0-3972439 NZ_CP043954.1:0-7791
[10:24:34 - Predict] Using model: /home/epi2melabs/conda/lib/python3.8/site-packages/medaka/data/r1041_e82_400bps_sup_variant_v4.3.0_model.tar.gz.
[10:24:34 - BAMFile] Creating pool of 16 BAM file sets.
[10:24:34 - Predict] Processing 4 long region(s) with batching.
[10:24:35 - MdlStrTF] Model <keras.engine.sequential.Sequential object at 0x146f2b44ae50>
[10:24:35 - MdlStrTF] loading weights from /tmp/tmplkbiynkf/model/variables/variables (using expect partial)
[10:24:35 - Sampler] Initializing sampler for consensus of region NZ_CP043953.1:0-1000000.
[10:24:35 - Sampler] Initializing sampler for consensus of region NZ_CP043953.1:999000-1999000.
[10:24:35 - PWorker] Running inference for 4.0M draft bases.
[10:24:35 - Sampler] Took 0.14s to make features.
[10:24:35 - Sampler] Initializing sampler for consensus of region NZ_CP043953.1:1998000-2998000.
[10:24:35 - Sampler] Took 0.16s to make features.
[10:24:35 - Sampler] Initializing sampler for consensus of region NZ_CP043953.1:2997000-3972439.
[10:24:35 - Sampler] Took 0.20s to make features.
[10:24:35 - Sampler] Took 0.19s to make features.
[10:24:35 - PWorker] Processed 0 batches
[10:24:35 - PWorker] All done, 0 remainder regions.
[10:24:35 - Predict] Processing 1 short region(s).
[10:24:36 - MdlStrTF] Model <keras.engine.sequential.Sequential object at 0x146f2ae18ee0>
[10:24:36 - MdlStrTF] loading weights from /tmp/tmplkbiynkf/model/variables/variables (using expect partial)
[10:24:36 - Sampler] Initializing sampler for consensus of region NZ_CP043954.1:0-7791.
[10:24:36 - Sampler] Took 0.00s to make features.
[10:24:36 - PWorker] Running inference for 0.0M draft bases.
[10:24:36 - PWorker] Processed 0 batches
[10:24:36 - PWorker] All done, 0 remainder regions.
[10:24:36 - Predict] Finished processing all regions.
[10:24:36 - MdlStrTF] Successfully removed temporary files from /tmp/tmplkbiynkf.
Cannot import pyabpoa, some features may not be available.
[10:24:37 - DataIndx] Loaded 1/1 (100.00%) sample files.
Writing to /tmp/bcftools.lBEdxX
Merging 0 temporary files
Cleaning
Done
Cannot import pyabpoa, some features may not be available.
[10:24:38 - Annotate] Getting chrom coordinates

Thought that it may be caused by lacking pyabpoa package, so I created my own docker image that returned the same vcfs.

Additional context - I know that bam files are also empty:

-rw-r--r-- 1 crunchg dialout  418 Feb  8 10:24 calls_to_ref.bam
-rw-r--r-- 1 crunchg dialout   32 Feb  8 10:24 calls_to_ref.bam.bai
-rw-r--r-- 1 crunchg dialout 6.2K Feb  8 10:24 consensus_probs.hdf
-rw-r--r-- 1 crunchg dialout 1.4K Feb  8 10:24 medaka.annotated.vcf
-rw-r--r-- 1 crunchg dialout  302 Feb  8 10:24 medaka.sorted.vcf
-rw-r--r-- 1 crunchg dialout  250 Feb  8 10:24 medaka.vcf

medaka.annotated.vcf content:

##fileformat=VCFv4.1
##medaka_version=1.11.3
##FILTER=<ID=PASS,Description="All filters passed">
##medaka_version=1.11.3
##FORMAT=<ID=GT,Number=1,Type=String,Description="Medaka genotype.">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Medaka genotype quality score">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Depth of reads at position, calculated from read pileup, capped to ~8000.">
##INFO=<ID=DPS,Number=2,Type=Integer,Description="Depth of reads at position by strand (fwd, rev), calculated from read pileup, capped to ~8000 total.">
##INFO=<ID=DPSP,Number=1,Type=Integer,Description="Depth of reads spanning pos +-25. This is not capped as in the case of DP and DPS.">
##INFO=<ID=SR,Number=.,Type=Integer,Description="Depth of spanning reads by strand which best align to each allele (ref fwd, ref rev, alt1 fwd, alt1 rev, etc.). This is not capped as in the case of DP and DPS.">
##INFO=<ID=AR,Number=2,Type=Integer,Description="Depth of ambiguous spanning reads by strand which align equally well to all alleles (fwd, rev). This is not capped as in the case of DP and DPS.">
##INFO=<ID=SC,Number=.,Type=Integer,Description="Total alignment score to each allele of spanning reads by strand (ref fwd, ref rev, alt1 fwd, alt1 rev, etc.) aligned with parasail: match 5, mismatch -4, open 5, extend 3">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
cjw85 commented 4 months ago

The -i argument expects a single fastq file

wojciech-galan commented 4 months ago

@cjw85 Thanks for assistance! That was not clear for me when I was reading the documentation. Solved!