hasindu2008 / f5c

Ultra-fast methylation calling and event alignment tool for nanopore sequencing data (supports CUDA acceleration)
https://hasindu2008.github.io/f5c/docs/overview
MIT License
139 stars 27 forks source link

slow5 index create error during eventalign #161

Closed lyj95618 closed 3 months ago

lyj95618 commented 4 months ago

Hi,

I tried to run f5c eventalign using the merged blow5 file that I created from slow5tools and here is the error I got:

[slow5_idx_init::INFO] Index file not found. Creating an index at '/hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/debug_nanopolish/try_f5c/slow5_dir/Sample2_merged.blow5.idx'.
[slow5_idx_insert::ERROR] Read ID '00013e59-1009-4676-bda4-c97d4b7e7eda' is duplicated At src/slow5_idx.c:495
[slow5_idx_build::ERROR] Inserting '00013e59-1009-4676-bda4-c97d4b7e7eda' to index failed At src/slow5_idx.c:335
[init_core] Error in loading SLOW5 index for /hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/debug_nanopolish/try_f5c/slow5_dir/Sample2_merged.blow5

when I created the blow5 file from the fast5 files, I didn't get any error, except the bad fast5 error which I think shouldn't be a big issue. Here is the log:

[list_all_items] Looking for '*.fast5' files in /hpf/projects/bkalish/MIA_m6A/nanopore/new_samples/raw_data/Sample2
[list_all_items] Looking for '*.fast5' files in /hpf/projects/bkalish/MIA_m6A/nanopore/new_samples/raw_data/Sample2/fast5
[f2s_main] 2316 fast5 files found - took 0.694s
[f2s_main] Just before forking, peak RAM = 0.000 GB
[f2s_iop] 8 proceses will be used.
[f2s_iop] Spawning 8 I/O processes to circumvent HDF hell.
[f2s_child_worker::ERROR] Bad fast5: Fast5 file '/hpf/projects/bkalish/MIA_m6A/nanopore/new_samples/raw_data/Sample2/fast5/PAW02356_2b285171_aba1e25c_3.68_5.fast5' could not be opened or is corrupted.
[f2s_child_worker::ERROR] Bad fast5: Fast5 file '/hpf/projects/bkalish/MIA_m6A/nanopore/new_samples/raw_data/Sample2/fast5/PAW02356_2b285171_aba1e25c_3.51_5.fast5' could not be opened or is corrupted.
[f2s_child_worker::ERROR] Bad fast5: Fast5 file '/hpf/projects/bkalish/MIA_m6A/nanopore/new_samples/raw_data/Sample2/fast5/PAW02356_2b285171_aba1e25c_3.71_5.fast5' could not be opened or is corrupted.
[f2s_child_worker::ERROR] Bad fast5: Fast5 file '/hpf/projects/bkalish/MIA_m6A/nanopore/new_samples/raw_data/Sample2/fast5/PAW02356_2b285171_aba1e25c_3.65_5.fast5' could not be opened or is corrupted.
[f2s_child_worker::ERROR] Bad fast5: Fast5 file '/hpf/projects/bkalish/MIA_m6A/nanopore/new_samples/raw_data/Sample2/fast5/PAW02356_2b285171_aba1e25c_3.64_5.fast5' could not be opened or is corrupted.
[f2s_child_worker::ERROR] Bad fast5: Fast5 file '/hpf/projects/bkalish/MIA_m6A/nanopore/new_samples/raw_data/Sample2/fast5/PAW02356_2b285171_aba1e25c_3.58_5.fast5' could not be opened or is corrupted.
[f2s_child_worker::ERROR] Bad fast5: Fast5 file '/hpf/projects/bkalish/MIA_m6A/nanopore/new_samples/raw_data/Sample2/fast5/PAW02356_2b285171_aba1e25c_3.60_5.fast5' could not be opened or is corrupted.
[f2s_iop] Child process 886809 exited with status=1.
[list_all_items] Looking for '*.slow5' files in /hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/debug_nanopolish/try_f5c/slow5_dir/Sample2
[merge_main] 368 files found - took 0.049s

[merge_main] Allocating new read group numbers - took 7.294s

[main] cmd: /hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/tools/slow5tools-v1.1.0/slow5tools merge /hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/debug_nanopolish/try_f5c/slow5_dir/Sample2 -t 15 -o /hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/debug_nanopolish/try_f5c/slow5_dir/Sample2_merged.blow5
[main] real time = 3288.504 sec | CPU time = 2402.129 sec | peak RAM = 2.147 GB

In the other issue that I created in the slow5tools github, the other sample failed at fast5toslow5. This sample didn't. So I thought it was fine to continuous with eventalign. Would you think it is again because something wrong happened between pod5 to fast5 conversion?

Thanks for all the help! Laur

hasindu2008 commented 4 months ago

@lyj95618

The Read ID '00013e59-1009-4676-bda4-c97d4b7e7eda' is duplicated means that somehow the same read got inserted multiple times, do you have duplicate copies of the same fast5 file?

The Bad fast5: Fast5 file '/hpf/projects/bkalish/MIA_m6A/nanopore/new_samples/raw_data/Sample2/fast5/PAW02356_2b285171_aba1e25c_3.68_5.fast5' could not be opened or is corrupted. should not be ignored.

Likely that something happened during the conversion. Did you have luck with direct pod5>blow5 conversion using blue-crab?

lyj95618 commented 4 months ago

Thanks for the quick reply!

I checked the fast5 folder and there aren't duplicated fast5 files (at least all the file names are unique). In terms of the bad fast5 error, since I have ~2300 fast5 files, I thought having a couple of bad fast5 shouldn't be a big issue?

I am still waiting for the person to transfer me the pod5 files :( At least the command that he used seems very normal pod5 convert to_fast5 -o /fast5 -t 20 /pod5

I will definitely try the pod5->blow5 after I receive the files. Thanks a lot again for all the help and quick responses! It provides me with the direction to solve the issue.

Laur

hasindu2008 commented 4 months ago

The thing is, when slwo5tools sees an error it immediately exists with an error, rather than continuing any further. Would it possible that you share a file that slow5toosl complains is unreadable, for example,/hpf/projects/bkalish/MIA_m6A/nanopore/new_samples/raw_data/Sample2/fast5/PAW02356_2b285171_aba1e25c_3.68_5.fast5? I can have a look on my side.

lyj95618 commented 4 months ago

Thank you for the quick reply!

I think the file is larger than the github attachment limit size, so I have uploaded that fast5 to the following link. https://drive.google.com/drive/folders/15F7OYnAAllWhchMzLzqGCMIZykar4ik0?usp=sharing

hasindu2008 commented 4 months ago

OK, I downloaded and tried to use h5dump to see if this is a valid hdf5 file. Seems even h5dump fails meaning that it is completely bad.

h5dump PAW02356_2b285171_aba1e25c_3.68_5.fast5
h5dump error: unable to open file "PAW02356_2b285171_aba1e25c_3.68_5.fast5"

So seems like the pod5 convert to_fast5 caused the issue.

Let us see how blue-crab on pod5 would do!

lyj95618 commented 4 months ago

An update from trying pod5 -> slow5 directly.

blue-crab works great! No error from pod5 -> slow5 for both of the samples. And merging the blow5s into one blow5 was fine too!

21-May-24 20:46:33 - blue-crab - [INFO]: Creating directory: /hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/blow5_dir/Sample1
21-May-24 20:46:33 - blue-crab - [INFO]: many2many: 43 pod5 files detected as input. Writing 1:1 pod5->s/blow5 to dir: /hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/blow5_dir/Sample1
21-May-24 20:46:33 - blue-crab - [INFO]: writing s/blow5 to dir: /hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/blow5_dir/Sample1
21-May-24 21:34:47 - blue-crab - [INFO]: pod5 -> s/blow5 complete

f5c index and eventalign was fine I think! And its way way way faster than using the fast5. I am still seeing something like this in the eventalign process. copying some of the lines in the following

[slow5_idx_get::ERROR] Read ID '8dae582d-a0d5-4331-935b-d6e57be50b9a' was not found. At src/slow5_idx.c:528
[read_slow5_single::WARNING] Slow5 record for read [8dae582d-a0d5-4331-935b-d6e57be50b9a] is unavailable/unreadable and will be skipped
[slow5_idx_get::ERROR] Read ID '4d20ee2b-881f-4f10-afd3-2b5bd6c626cb' was not found. At src/slow5_idx.c:528
[read_slow5_single::WARNING] Slow5 record for read [4d20ee2b-881f-4f10-afd3-2b5bd6c626cb] is unavailable/unreadable and will be skipped
[slow5_idx_get::ERROR] Read ID '74bb3ee8-5ee9-4071-acda-957e9dadafbb' was not found. At src/slow5_idx.c:528
[read_slow5_single::WARNING] Slow5 record for read [74bb3ee8-5ee9-4071-acda-957e9dadafbb] is unavailable/unreadable and will be skipped
[slow5_idx_get::ERROR] Read ID 'f6849a2c-493d-4327-b3b5-c19911b70496' was not found. At src/slow5_idx.c:528
[read_slow5_single::WARNING] Slow5 record for read [f6849a2c-493d-4327-b3b5-c19911b70496] is unavailable/unreadable and will be skipped
[meth_main::28.467*0.97] 512 Entries (0.6M bases) loaded
[pthread_processor::32.191*1.58] 512 Entries (0.6M bases) processed

The end of the eventalign log looks like this. it seems like a lot of reads got skipped because of MAPQ < 20? Should I make the cutoff lower?

[meth_main::16421.942*9.50] 25 Entries (0.0M bases) loaded
[pthread_processor::16422.021*9.50] 25 Entries (0.0M bases) processed
[meth_main] skipped unmapped: 0, skipped secondary: 0, skipped low_mapq: 5397672
[meth_main] total entries: 3977753, qc fail: 15346, could not calibrate: 151019, no alignment: 57694, bad reads: 29915
[meth_main] total bases: 3614.1 Mbases
[meth_main] Data loading time: 15376.382 sec
[meth_main]     - bam load time: 278.861 sec
[meth_main]     - fasta load time: 7537.501 sec
[meth_main]     - slow5 load time: 7538.241 sec
[meth_main] Data processing time: 10409.472 sec
[meth_main] Data output time: 1503.388 sec
[meth_main::INFO] Performance bounded by file I/O. File I/O took 4966.910 sec than processing
[meth_main::WARNING] Skipped 5397672 reads with MAPQ < 20. Use --min-mapq to change the threshold.
[main] Version: 1.4
[main] CMD: /hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/tools/f5c-v1.4/f5c_x86_64_linux eventalign --reads /hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/debug_nanopolish/try_f5c/Sample1_merged.fastq.gz --bam /hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/cdna_bam_for_xpore/Sample1/Sample1.bam --genome /hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/annotationfile/Mus_musculus.GRCm38.cdna.all.fa --signal-index --slow5 /hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/debug_nanopolish/try_f5c/slow5_dir/Sample1_merged.blow5 --scale-events --rna --summary /hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/debug_nanopolish/try_f5c/Sample1.nanopolish_summary.txt --threads 32
[main] Real time: 16422.711 sec; CPU time: 155969.454 sec; Peak RAM: 6.117 GB

Another question, I previously had some other samples that were processed by using nanopolish eventalign. If I want to combine the 2 new samples with the old samples for downstream analysis, should I re-process those samples with f5c? Or there shouldn't be a difference in the eventalign results between nanopolish and f5c?

Thanks a lot for the help! Laur

hasindu2008 commented 4 months ago

Hi @lyj95618

Bad reads are 29915 (Read ID xx was not found are these), which is an insignificant number compared to the total, so do not worry about it. These happen due to split reads (reads which are split into two by the basecaller, which then assign a different read ID and thus there is no associated signal record for it).

Is your data RNA or DNA? For direct-RNA, I have seen that many reads are below mapping quality 20 (this is determined by the aligner based on how probable the mapping is right). If you want to increase the sensitivity, yes, you can even bring this down to 0. But if you want a very high precision, filtering would be better.

Is your data R9? If so nanopolish and f5c results should be very similar (but nanopolish have a --mapq 0 by default in the most recent versions I think), so you might want to run f5c with --mapq 0 I guess. If you data is R10 or RNA004, nanopolish does not have models to support it I think.

lyj95618 commented 4 months ago

Thanks for the detailed reply!

Both my old and new samples are direct RNA. The reason why I asked this is that my old samples are R9 (I believe) and I used the default nanopolish eventalign to process them before and my new samples are RNA004 (f5c autodetect).

As you are saying there are differences in the mapping quality cutoff, I think maybe I should re-process the old samples using f5c eventalign (I think it should be okay to just the nanopolish index output to run f5c eventalign?).

Thanks a lot for the help!

hasindu2008 commented 4 months ago

Yeh, the Nanopolish index is compatible with f5c. But before that can you check if the stuff is actually R9? You can convert one FAST5 file to BLOW5, Then do slow5tools skim --hdr reads.blow5 | grep ""sequencing_kit This can help to identify whether rna002 (r9) or the newer rna004.

lyj95618 commented 4 months ago
[main] cmd: /hpf/largeprojects/ccmbio/acelik_files/kalish/nanopore/nanopore/lauren_test/tools/slow5tools-v1.1.0/slow5tools skim --hdr PAM78633_pass_f83266c0_6a9f5a6e_99.slow5
[main] real time = 0.087 sec | CPU time = 0.089 sec | peak RAM = 0.004 GB
@sequencing_kit sqk-rna002

So old samples are indeed R9

hasindu2008 commented 3 months ago

Closing the issue for now. Feel free to open a new issue / reopen this issue if you have nay more things to clarify.