jts / nanopolish

Signal-level algorithms for MinION data
MIT License
559 stars 159 forks source link

nanopolish index #1072

Closed rezarahman12 closed 1 year ago

rezarahman12 commented 1 year ago

Dear Jared Recently, I've acquired nanopore direct RNA-seq data of my experiment. The raw fast5 file was vbz compressed so I converted to gzip compression type using "ont_fast5_api" (https://github.com/nanoporetech/ont_fast5_api). Then, I'm trying to run nanopolish index on .fast5 file. I've only a single fast5 file for each replicate. When, I'm running the nanopolish index on fast5 file, three files in being created- FAS03824_pass_d70ed67e_c4d2fdd2_0.fastq.index FAS03824_pass_d70ed67e_c4d2fdd2_0.fastq.index.fai FAS03824_pass_d70ed67e_c4d2fdd2_0.fastq.index.gzi But the job is running for 15 hours more without any change in the size of the above three files. the job seems just keep running. I think another file should be produced namely "FAS03824_pass_d70ed67e_c4d2fdd2_0.fastq.index.read.db" or it is working normal and need to keep running the nanopolish index ? Or if I can convert a single fast5 file into multi fast5 file and run on nanopolish?

Thank you so much for your kind consideration.

Best regards Reza

jts commented 1 year ago

Hi,

It can take a long time to index large runs. Did this ever finish?

Jared

rezarahman12 commented 1 year ago

Thanks, Jared. After 72 hours of running it got killed without any change in the file generated.

Do you think it is happening due to recent vbz compression? I've changed vbz to gzip using ont_fast5_api.

Many thanks Reza

jts commented 1 year ago

Try providing the sequencing summary file using the -s option, it will make indexing run much faster.

rezarahman12 commented 1 year ago

I tried that way as well but it gives a warning that the sequencing_summary file is invalid. In the past, I used publicly available nanopore direct RNA-seq data in nanopolish where I did not get this error.

warning: detected invalid summary file entries for 1 of 1 fast5 files These files will be indexed without using the summary file, which is slow. [readdb] indexing /scratch/project_mnt/S0077/ONT_Reza/KD1_1/KD1_1_fast5_pass_gzip

hasindu2008 commented 1 year ago

@rezarahman12 You may try using f5c index with -t and --iop options for potentially fast indexing. f5c index and nanopolish index are intercompatible.

rezarahman12 commented 1 year ago

Thank you so much @hasindu2008. Now, I'm running in f5c using the below code-

module load f5c

f5c index -d /scratch/project_mnt/S0077/ONT_Reza/KD1_1/KD1_1_fast5_pass_gzip -s /scratch/project_mnt/S0077/ONT_Reza/KD1_1/fastq_dir/sequencing_summary.txt /scratch/project_mnt/S0077/ONT_Reza/KD1_1/fastq_dir/pass/basecalled.fastq -t 62 --iop 32

However, I'm getting below warning- [parse_index_options::WARNING] --iop is incompatible with sequencing summary files. Option --sequencing-summary-file will be ignored [parse_index_options::INFO] Consider using --slow5 option for fast indexing, methylation calling and eventalignment. See f5c section under https://hasindu2008.github.io/slow5tools/workflows.html for an example. [find_all_fast5] Looking for fast5 in /scratch/project_mnt/S0077/ONT_Reza/KD1_1/KD1_1_fast5_pass_gzip [f5c_index_iop] 1 fast5 files found - took 0.005s [f5c_index_iop] Spawning 32 I/O processes to circumvent HDF hell

I keep running to see the outcome. I'll update you soon. Kind regards Reza

hasindu2008 commented 1 year ago

Yeh, you can ignore the warning, but wait - you have one single FAST5? How large is it?

rezarahman12 commented 1 year ago

@hasindu2008 Thanks for your great help. Yes, I've only a single fast5 file with a 43.1GB size. Also, I've another fast5 file for another sample which is a single fast5 file with 85.1GB size. Many thanks

hasindu2008 commented 1 year ago

@rezarahman12. ohhh they are pretty big (unlike the 4000 reads per FAST5 that typically is) and no wonder it takes ages. In this case, --iop will not help then. Is this a publicly available dataset sample?

Another option I can think of is, using slow5tools f2s command to convert this FAST5 to BLOW5. The conversion would take some time, but potentially less than the indexing due to how the HDF5 iterator works. Then the BLOW5 file can be provided as input to nanopolish which is usually much faster.

rezarahman12 commented 1 year ago

@hasindu2008 The datasets are generated by ourselves. If I split the single fast5 into multi-fast5 using ont-fast5-api, does it help to run the index in f5c?

hasindu2008 commented 1 year ago

Likely to help f5c index. You could give it a try, but that splitting also will take a lot of time I presume.

rezarahman12 commented 1 year ago

@hasindu2008 I've used slow5tools. Now I wanted to run f5c index. However, I'm getting an error, so posted my issue in slow5tools github. Thank you for your consideration.

jts commented 1 year ago

Thanks @hasindu2008 for the help with this issue.