jts / nanopolish

Signal-level algorithms for MinION data
MIT License
561 stars 159 forks source link

bad fast5 files and regions not called #456

Closed DEsalasL closed 5 years ago

DEsalasL commented 6 years ago

Hi I have two questions, I'm indexing, mapping the reads, and running nanopolish (v 0.10.1) (all of the information for them is the same directory). However, during nanopolish variants I am getting two messages that are bit worrisome to me: 1)sevral messages that suggest that above 95% of reads that are flagged as bad. here two examples: [post-run summary] total reads: 1146, unparseable: 0, qc fail: 0, could not calibrate: 0, no alignment: 1, bad fast5: 1130 [post-run summary] total reads: 79, unparseable: 0, qc fail: 0, could not calibrate: 0, no alignment: 0, bad fast5: 78 2) I'm using default for haplotype (1000) but I'm getting this Warnings like this: 115 variants in span, region not called [992707 992943]

I'd like to understand why these files are considered bad or the regions are not called, and if it is possible to get suggestions that help me fix the problem. Please see the commands I'm running, as well as, the output from the mapping and excerpts from the variants output.

Many thanks for any suggestion,

Commands:

Dir1a=/path/to/fast5s/ Dir1b=/path/to/fast5s/ FqReads=Reads.fq Assembly=scaffolds.fas BamFiles=All.sorted.bam Nanopolish=/path/to/nanopolish ResultsDirName=Results-polishing Preffix=polished Makerange=/path/to/nanopolish_makerange.py

$Nanopolish index -d $Dir1a -d $Dir1b -f summary_files.fof $FqReads

minimap2 -ax map-ont -t 20 $Assembly $FqReads| samtools sort -@ 20 -o $BamFiles -T ${BamFiles::-11}.tmp - samtools index -@ 20 $BamFiles

python $Makerange $Assembly | parallel --gnu --results $ResultsDirName -P 10 \ $Nanopolish variants --consensus -o $Preffix_consensus.{1}.fa -w {1} \ -r $FqReads -b $BamFiles -g $Assembly -t 2 --min-candidate-frequency 0.1

--- this is the output of the mapping

   [readdb] num reads: 622326, num reads with path to fast5: 9237

[M::mm_idx_gen::3.2311.03] collected minimizers [M::mm_idx_gen::3.6431.57] sorted minimizers [M::main::3.6431.57] loaded/built the index for 184 target sequence(s) [M::mm_mapopt_update::3.8241.54] mid_occ = 123 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 184 [M::mm_idx_stat::3.9371.53] distinct minimizers: 6425134 (85.71% are singletons); average occurrences: 1.555; average spacing: 5.217 [M::worker_pipeline::81.69816.52] mapped 84444 sequences [M::worker_pipeline::138.47718.00] mapped 83705 sequences [M::worker_pipeline::193.33918.46] mapped 83272 sequences [M::worker_pipeline::248.07318.79] mapped 84210 sequences [M::worker_pipeline::298.37419.02] mapped 84307 sequences [M::worker_pipeline::352.23519.17] mapped 84364 sequences [M::worker_pipeline::403.38419.28] mapped 83297 sequences [M::worker_pipeline::414.198*19.13] mapped 34727 sequences [M::main] Version: 2.10-r761

jts commented 6 years ago

Hi,

This line:

[readdb] num reads: 622326, num reads with path to fast5: 9237

indicates that very few of the reads have a path to the fast5 file. Can you double check the paths and the summary files are correct when running nanopolish index?

Jared

DEsalasL commented 6 years ago

Hi there, Sorry for the delay in this follow up: Yes, we double checked the paths, and they were correct (files in path exist and aren't empty). Since we were puzzled by this weird result, we tried to nanopolish on assemblies that we had polished with older versions of the program (previous to 0.10.1). In all cases the older versions of nanopolish reported None bad fast5 reads (0), while the 0.10.1 applied on the same read set is currently reporting above 95% of the reads as bad fast5. We also noticed that in one of the older runs the number of reads after indexing was 704620, with 704185 that had path to fast5 and the report for the same reads set with 0.10.1 is now reported as 580845 reads with only 1686 have path to fast5.

We ran out of ideas of what we could be doing wrong. would it be possible that your team checks if nanopolish indexing is working correctly? Many thanks D

jts commented 6 years ago

To help narrow it down, can you let me know what version works as expected?

Jared

DEsalasL commented 6 years ago

Hi, I'm afraid I cann't give you the version as we have it installed under conda environment and it is set to undergo daily updates. The 'last' version that worked without any problem that we could track was version that seem to be available around August 17 2018 (but have no idea if it was 0.8 or 0.9). I apologize for not being able to give you the version number.

jts commented 6 years ago

Can you run nanopolish index (for the latest version) with the -vvv option and see if it prints out any useful information?

DEsalasL commented 6 years ago

Hi Jared, After we used the -v option we noticed that nanopolish 0.10.1 was reporting a problem with the sequencing_summary.txt file (see below), so we ran nanopolish again but this time without using the sequencing_summary.txt and everything worked ok. What it is still puzzling is that our previous runs with older nanopore version and the same sequencing_summary.txt had worked just fine. I hope this information is useful for debugging:

Example output using the sequencing summary : Nanopolish=/scratch2/software/anaconda/envs/nanopolish/bin/nanopolish $Nanopolish index -v -d /scratch2/user/Thal_New/reads/ -s /scratch2/user/Thal_New/albacore_basecall_2.0.2/sequencing_summary.txt \ /scratch2/user/Thal_New/cut_reads/ThalNew_ALL_chop.fastq

Error found: Could not find read /scratch2/user/Thal_New/reads/20171006_1300_THALP2/fast5/0/MinION_HP_20171006_FAH26239_MN19285_mux_scan_THALP2_30911_read_47_ch_268_strand.fast5 in sequencing summary file Could not find read /scratch2/user/Thal_New/reads/20171006_1300_THALP2/fast5/0/MinION_HP_20171006_FAH26239_MN19285_mux_scan_THALP2_30911_read_2477_ch_240_strand.fast5 in sequencing summary file Could not find read /scratch2/user/Thal_New/reads/20171006_1300_THALP2/fast5/0/MinION_HP_20171006_FAH26239_MN19285_mux_scan_THALP2_30911_read_126_ch_324_strand.fast5 in sequencing summary file Could not find read /scratch2/user/Thal_New/reads/20171005_2222_THALP/fast5/70/MinION_HP_20171005_FAH26239_MN19285_sequencing_run_THALP_70871_read_13201_ch_8_strand.fast5 in sequencing summary file Could not find read /scratch2/user/Thal_New/reads/20171005_2222_THALP/fast5/70/MinION_HP_20171005_FAH26239_MN19285_sequencing_run_THALP_70871_read_2529_ch_431_strand.fast5 in sequencing summary file [readdb] num reads: 580845, num reads with path to fast5: 168

Example running nanopolish without the sequencing summary:

$Nanopolish index -v -d /scratch2/user/Thal_New/reads/ \ /scratch2/user/Thal_New/cut_reads/ThalNew_ALL_chop.fastq

[readdb] indexing /scratch2/user/Thal_New/reads/20171005_2222_THALP/fast5/94 [readdb] indexing /scratch2/user/Thal_New/reads/20171005_2222_THALP/fast5/70 [readdb] num reads: 704620, num reads with path to fast5: 704185

jts commented 6 years ago

Are you able to find entries for those reads in the sequencing summary? How many reads are in the summary file? Our parser for the sequencing summary file hasn't changed in quite awhile so I'm not sure if that is the problem.

DEsalasL commented 6 years ago

Hi Jared, I apologize for this post, I'm aware it sounds weird, but I just thought to give you an update from our side. Regarding to your question, not all the summary files contain the exact number of reads (which could point to an albacore problem). For that reason, we decided to revisit several assemblies from different species. So far, we are unable to use 0.10.1 to reproduce runs that worked before with versions 0.84 or 0.9.0. For example: 1) For paths to fast5s that are reported in the summary files that do exist, the report continues to be bad fast5. 2) Calling nanopore 0.10.1 without sequencing summaries seems to work as expected, except if there are no inner directories in the fast5 main directory (i mean if all fast5s files are in that directory instead of being in subdirectories). The indexing doesn't fail, but the core gets dumped after variants are called. Also, some contigs cann't be read after make_range is applied, but this is randomly (it is never the same contig): [fai_load] build FASTA index. error: faidx could not get contig length for contig c_000000000002 [fai_load] build FASTA index. HDF5-DIAG: Error detected in HDF5 (1.8.14) thread 23444009267200:

000: H5F.c line 591 in H5Fopen(): invalid file name

major: Invalid arguments to routine
minor: Bad value

terminate called after throwing an instance of 'hdf5_tools::Exception' what(): /Raw/Reads/Read_48503/Signal: : error in H5Fopen

As you can read, this doesn't make too much sense, and we are out of ideas. Thanks for you patience and input.

DEsalasL commented 5 years ago

Hi Jared, We found out that the error reported in this thread only occurs in the conda versions of Nanopolish. We decided to use the 0.10.1 under regular install. We'll soon upgrade to 0.10.2 DS

Genereneg commented 5 years ago

Howdy!

Similar issue, but with 0.11.0.

[agener@sphere xmas_nanopolish_trial]$ head slurm-198346.out slurm-198347.out slurm-198348.out ==> slurm-198346.out <== srun: fatal: No command given to execute. [readdb] indexing /home/agener/xmas_nanopolish_trial/Users/alejandrogener/Desktop/fast5_all/unsorted_all.fast5/ could not open fast5 file: /home/agener/xmas_nanopolish_trial/Users/alejandrogener/Desktop/fast5_all/unsorted_all.fast5//slurm-198213.out could not open fast5 file: /home/agener/xmas_nanopolish_trial/Users/alejandrogener/Desktop/fast5_all/unsorted_all.fast5//slurm-198214.out [readdb] num reads: 358041, num reads with path to fast5: 159818

==> slurm-198347.out <== srun: fatal: No command given to execute. [M::mm_idx_gen::0.0031.51] collected minimizers [M::mm_idx_gen::0.0072.31] sorted minimizers [M::main::0.0072.29] loaded/built the index for 1 target sequence(s) [M::mm_mapopt_update::0.0082.21] mid_occ = 2 [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1 [M::mm_idx_stat::0.0082.09] distinct minimizers: 3012 (100.00% are singletons); average occurrences: 1.000; average spacing: 5.412 [M::worker_pipeline::4.4922.10] mapped 198223 sequences [M::main] Version: 2.15-r913-dirty [M::main] CMD: /home/agener/minimap2/minimap2 -a -x map-ont /home/agener/xmas_nanopolish_trial/CM004184.1.fasta /home/agener/xmas_nanopolish_trial/Galaxy877-[Collapse_on_data_870].fasta

==> slurm-198348.out <== srun: fatal: No command given to execute. [post-run summary] total reads: 720, unparseable: 0, qc fail: 0, could not calibrate: 0, no alignment: 0, bad fast5: 708 [agener@sphere xmas_nanopolish_trial]$

Giving an empty methylation.tsv.

Ignoring the faulty srun and superfluous files in the fast5 folder.

Suggestions?

Thanks!