jts / nanopolish

Signal-level algorithms for MinION data
MIT License
557 stars 160 forks source link

Empty .vcf file #531

Closed gktahon closed 5 years ago

gktahon commented 5 years ago

Hey all,

I recently started using Nanopore sequencing and after the assembly steps I have a problem when running nanopolish.

First I assembled the genome using Canu. To do so, I used the basecalled reads (using Flappie, file is called basecalls.fq). First I corrected and trimmed these reads. This gave me a subset of reads in a fasta file (arcobacter.trimmedReads.fasta, containing 2930 sequences) with which I assembled the genome. It gave one contig of ~2.5 Mb in a .fasta file (arcobacter.contigs.fasta, header is >tig00000001 len=2355158 reads=2798 covStat=982.79 gappedBases=no class=contig suggestRepeat=no suggestCircular=yes). After aligning this sequence with the complete reference genome of the same bacterial strain I found that I am missing roughly 25.000 nucleotides. This is where I started using nanopolish (version 10.2).

For nanopolish, I used the tutorial given on this website: https://nanopolish.readthedocs.io/en/latest/quickstart_consensus.html

In a first step, I indexed the raw reads. Here is used the following command: nanopolish index -d path_to_fast5_files/ arcobacter.trimmedReads.fasta

This works like a charm (I think). The process completes saying the following: [readdb] num reads: 495191, num reads with path to fast5: 495191 It gives me the files arcobacter.trimmedReads.fasta.index.readdb, .gzi, .fai and arcobacter.trimmedReads.fasta.index.

Secondly, I run minimap (version 2.11) and samtools (version 1.8) as followed: minimap2 -ax map-ont -t 8 arcobacter.contigs.fasta arcobacter.trimmedReads.fasta | samtools sort -o reads.sorted.bam -T reads.tmp

This results in a file called reads.sorted.bam and gives me the following output: [M::mm_idx_gen::0.1310.96] collected minimizers [M::mm_idx_gen::0.1611.75] sorted minimizers [M::main::0.1611.75] loaded/built the index for 1 target sequence(s) [M::mm_mapopt_update::0.1721.70] mid_occ = 10 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1 [M::mm_idx_stat::0.1791.67] distinct minimizers: 403007 (92.91% are singletons); average occurrences: 1.096; average spacing: 5.333 [M::worker_pipeline::5.0946.53] mapped 2930 sequences [M::main] Version: 2.11-r797 [M::main] CMD: minimap2 -ax map-ont -t 8 arcobacter.contigs.fasta arcobacter.trimmedReads.fasta [M::main] Real time: 5.115 sec; CPU: 33.303 sec

The next command is samtools index reads.sorted.bam. This gives me the file reads.sorted.bam.bai

After this I run the following command: nanopolish variants --consensus -o polished.vcf -w "tig00000001:1-2355158" -r arcobacter.trimmedReads.fasta -b reads.sorted.bam -g arcobacter.contigs.fasta

In the command line, the following message appears: total reads: 2916, unparseable: 0, qc fail: 0, could not calibrate: 0, no alignment: 2916, bad fast5: 0

The resulting vcf file looks like this:

fileformat=VCFv4.2

nanopolish_window=tig00000001:1-2355157

INFO=

INFO=

INFO=

INFO=

INFO=

FORMAT=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample

When i finally run the last command: nanopolish vcf2fasta --skip-checks -g arcobacter.contigs.fasta polished.vcf > polished_genome.fa

it gives me this: rewrote contig tig00000001 with 0 subs, 0 ins, 0 dels (0 skipped)

I also tried changing the -w parameter by changing the window and afterwards combining the different .vcf files, but these files look exactly the same and the vcf2fasta output is also the same.

Could someone help me to see what I am doing wrong? Also, another final question. I saw that nanopolish variants can also be used to detect SNPs and indels with respect to a reference genome. Since for the bacterial genome I sequenced with Oxford Nanopore I also have the closed reference genome, would it be possible to tell me if I can correct my draft assembly using this reference and how this is done?

Many thanks in advance! Guillaume

jts commented 5 years ago

Hi,

You should use the original reads from the basecaller in nanopolish, not the trimmed and corrected reads. Nanopolish expects the reads to be unmodified after basecalling (with a few exceptions).

Jared

gktahon commented 5 years ago

Dear jared,

Thank you for the swift reply!!

I have redone the analysis but the problem stays the same. I have done the initial indexing with the original fastq file and with the respective fasta file, but the result is the same. For the e coli test case, everything works perfect

I compared my data with the files from the e coli test case and I think I know where the problem is, but I have no idea on how to fix it or what may be causing it.

The indexing goes like it should, at least I think it does. I end with a line saying the following: [readdb] num reads: 149050, num reads with path to fast5: 149050

Subsequently I run the minimap and samtools command to create the reads.sorted.bam file. After checking the head of this file, I can see a difference with the file of the test case, i.e. my file is missing nucleotide data.

The output of the minimap command looks like this: [M::mm_idx_gen::0.1100.93] collected minimizers [M::mm_idx_gen::0.1381.34] sorted minimizers [M::main::0.1381.34] loaded/built the index for 1 target sequence(s) [M::mm_mapopt_update::0.1551.30] mid_occ = 10 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1 [M::mm_idx_stat::0.1631.29] distinct minimizers: 402984 (92.97% are singletons); average occurrences: 1.095; average spacing: 5.331 [M::worker_pipeline::88.4302.99] mapped 85144 sequences [M::worker_pipeline::92.822*2.97] mapped 22772 sequences [M::main] Version: 2.11-r797 [M::main] CMD: minimap2 -ax map-ont arcobacter.contigs_AL_Corr.fasta out2.fasta [M::main] Real time: 92.845 sec; CPU: 275.808 sec [bam_sort_core] merging from 1 files and 1 in-memory blocks...

The header of my file looks like this: caad3990-978d-4259-aea1-4e0fa9032ebc 256 tig00000001 1 0 2887S13M1D14M1D4M1D36M1I16M2I68M1D9M1D9M5D1M4D1M1D11M1D7M1I6M1D2M1D10M4D36M1D4M2D12M1D21M1I13M1D12M1I39M1D5M4D17M1I37M1D15M1I17M2D20M2I27M1D143M1D30M1I62M1D15M1D91M1D5M2D41M2D16M1D23M3D101M2D54M1I13M1D7M2I13M1I20M1I106M1D4M2I34M3D10M2I2M1I13M1I34M9D47M1D35M1D19M1D8M1D34M4D12M1D17M11D24M1I18M1D18M2D16M1I1M1I53M2D2M2I1M2I5M1D4M2I14M3D26M1D19M1I3M2D31M1D10M2D9M8D22M6D8M1D44M1D12M1I19M10S 00 * NM:i:181 ms:i:2890 AS:i:2890 nn:i:0 tp:A:S cm:i:171 s1:i:1179 dv:f:0.0481 bf3883ac-29e9-48f4-b88c-529a9be24bd1 256 tig00000001 1 0

In the file from the test case, this is followed by nucleotide data, for example: 0a238451-b9ed-446d-a152-badd074006c4 0 tig00000001 15 60 6580S21M1I2M3I13M...1D7M1I25M1D37M2D23M2D7M1D4M3I11M2I41M1I15M1D3M1D1M3D17M2I5M1D18M1I1M3I8M1D17M3D47M2D14M3D3M1D32M1I33M1I16M2D45M5D10M1D32M1I61M1D23M1D40M38S * 0 0 TTAAAAACTCGGTATGCTTCGTTCGGTTACGTATTACTACTTGCCTGTCGGCTCTATCTTCGGCGTCTGCTTGGGTGTTTAGCTCAATGACCTTTGCAGGCGACACGCAACTGGCGAACGTCAAAGCGGTGGATGATATCTACCCGATGCTGGCGATCCTGCCAAGCTA....

This will obviously be why the rest of the process is not working. What may be causing this and how can I solve it? Sorry for the questions, I am fairly new to polishing with nanopore data.

Many thanks in advance! Guillaume

jts commented 5 years ago

Can you provide me with the complete list of commands that you ran?

gktahon commented 5 years ago

Dear Jared,

These are the commands that I used:

nanopolish index -d path_to_fast5/ basecalls.fastq minimap2 -ax map-ont -t 8 arcobacter.contigs.fasta basecalls.fastq | samtools sort -o reads.sorted.bam -T reads.tmp samtools index reads.sorted.bam samtools view reads.sorted.bam | head nanopolish variants --consensus -o polished.vcf -w "tig00000001:1-25000" -r basecalls.fastq -b reads.sorted.bam -g arcobacter.contigs.fasta nanopolish vcf2fasta --skip-checks -g arcobacter.contigs.fasta polished.vcf > polished_genome.fa

path_to_fast5/ is the path to a directory containing several subdirectories in which the raw fast5 files are. arcobacter.contigs.fasta is a fasta file containing a single entry after assembling the genome using canu. To make the assembly I first corrected basecalls.fastq, followed by trimming them. With that of reads output I assembled the genome and obtained the single sequence.

I also tried these commands with basecalls.fasta instead of basecalls.fastq. To make basecalls.fasta I just modified basecalls.fastq by using the following command: sed -n '1~4s/^@/>/p;2~4p' basecalls.fastq > basecalls.fasta

In the final step (nanopolish variants) I also tested different windows ranging from "tig00000001:1-2000" to "tig00000001:1-2355157", but the result is the same, being an output that says something like this: total reads: x, unparseable: 0, qc fail: 0, could not calibrate: 0, no alignment: x, bad fast5: 0

jts commented 5 years ago

Hmm. Is basecalled.fastq the exact file that came out of albacore (or guppy). If so can you send me a few reads, and the corresponding fast5s?

Jared

gktahon commented 5 years ago

Dear Jared,

basecalled.fastq is the original file I received from the sequencing group. Basecalling was done using albacore and the file is actually namesd fastq_albacore.fastq. So in the commands listed above is obviously used that file name when needed. Additionally I also received the basecalled data using Flappie in a file named basecalls.fq, but with this file the indexing already does not work.

As requested, I have attached a zip file containing the genome assembly (arcobacter_contigs_AL_Corr.fasta) after running canu on the subset of corrected and trimmed reads. The file also contains the first 93 reads of the fastq file (file fastq_albacore.fastq, I hope this is enough) and the corresponding fast5 files I received from the sequencing group.

Many thanks in advance Guillaume

Test Jared.zip

jts commented 5 years ago

Hi @gktahon,

I just looked into this and it appears to be a bug in ONT's software. They annotated the "experiment_type" in the fast5s as "rna" but it appears you used a cDNA kit. I've just pushed a workaround that should let you proceed, and let ONT know so they can fix this issue upstream.

Jared

gktahon commented 5 years ago

Hi Jared,

It works like a charm now! The parallel doesn't work, but that seems to be because of our machine. Anyhow, I just submitted a job to do them all and it went quite fast. Now I get the result that I could expect: [vcf2fasta] rewrote contig tig00000001 with 232 subs, 16898 ins, 440 dels (0 skipped)

Many thanks for your amazing help!