hasindu2008 / f5c

Ultra-fast methylation calling and event alignment tool for nanopore sequencing data (supports CUDA acceleration)
https://hasindu2008.github.io/f5c/docs/overview
MIT License
140 stars 27 forks source link

Eventalign fails for most reads fail with "no alignment" or "not calibrated" #179

Open mzdravkov opened 3 weeks ago

mzdravkov commented 3 weeks ago

Hello! I'm trying to run eventalign for a dataset of small oligonucleotides of size 100nt, but most of the reads fail with either "no alignment" or "not calibrated".

This is the command that I execute along with the output of f5c:

> f5c eventalign -t 8 --slow5 oligo1_sample.blow5 -r oligo1_sample.fastq -b oligo1_RNA004_basecalled_dorado_aligned_u4_sample.bam -g ref_u2t.fa --rna --secondary=yes --min-mapq 0 --print-read-names --scale-events --samples -o oligo1_sample_eventalign.tsv --pore rna004
[init_core::INFO] builtin RNA004 nucleotide model loaded
[meth_main::0.188*1.27] 512 Entries (0.1M bases) loaded
[pthread_processor::0.284*3.58] 512 Entries (0.1M bases) processed
[meth_main::0.375*3.06] 512 Entries (0.1M bases) loaded
[pthread_processor::0.449*3.95] 512 Entries (0.1M bases) processed
[meth_main::0.557*3.40] 512 Entries (0.0M bases) loaded
[pthread_processor::0.630*3.97] 512 Entries (0.0M bases) processed
[meth_main::0.738*3.55] 512 Entries (0.0M bases) loaded
[pthread_processor::0.803*3.89] 512 Entries (0.0M bases) processed
[meth_main::0.916*3.57] 512 Entries (0.0M bases) loaded
[pthread_processor::0.977*3.87] 512 Entries (0.0M bases) processed
[meth_main::1.101*3.55] 512 Entries (0.0M bases) loaded
[pthread_processor::1.164*3.77] 512 Entries (0.0M bases) processed
[meth_main::1.278*3.53] 512 Entries (0.0M bases) loaded
[pthread_processor::1.341*3.75] 512 Entries (0.0M bases) processed
[meth_main::1.455*3.54] 512 Entries (0.0M bases) loaded
[pthread_processor::1.518*3.74] 512 Entries (0.0M bases) processed
[meth_main::1.634*3.55] 512 Entries (0.0M bases) loaded
[pthread_processor::1.696*3.74] 512 Entries (0.0M bases) processed
[meth_main::1.806*3.59] 512 Entries (0.0M bases) loaded
[pthread_processor::1.878*3.74] 512 Entries (0.0M bases) processed
[meth_main::1.986*3.60] 512 Entries (0.0M bases) loaded
[pthread_processor::2.051*3.75] 512 Entries (0.0M bases) processed
[meth_main::2.161*3.61] 512 Entries (0.0M bases) loaded
[pthread_processor::2.224*3.72] 512 Entries (0.0M bases) processed
[meth_main::2.331*3.60] 512 Entries (0.0M bases) loaded
[pthread_processor::2.393*3.71] 512 Entries (0.0M bases) processed
[meth_main::2.511*3.58] 512 Entries (0.0M bases) loaded
[pthread_processor::2.579*3.71] 512 Entries (0.0M bases) processed
[meth_main::2.686*3.61] 512 Entries (0.0M bases) loaded
[pthread_processor::2.751*3.72] 512 Entries (0.0M bases) processed
[meth_main::2.864*3.61] 512 Entries (0.0M bases) loaded
[pthread_processor::2.934*3.71] 512 Entries (0.0M bases) processed
[meth_main::3.040*3.62] 512 Entries (0.0M bases) loaded
[pthread_processor::3.106*3.71] 512 Entries (0.0M bases) processed
[meth_main::3.218*3.62] 512 Entries (0.0M bases) loaded
[pthread_processor::3.281*3.70] 512 Entries (0.0M bases) processed
[meth_main::3.394*3.62] 512 Entries (0.0M bases) loaded
[pthread_processor::3.462*3.72] 512 Entries (0.0M bases) processed
[meth_main::3.474*3.71] 264 Entries (0.0M bases) loaded
[pthread_processor::3.509*3.75] 264 Entries (0.0M bases) processed
[meth_main] skipped unmapped: 0, skipped secondary: 0, skipped low_mapq: 0
[meth_main] total entries: 9992, qc fail: 0, could not calibrate: 3029, no alignment: 6824, bad reads: 0
[meth_main] total bases: 0.9 Mbases
[meth_main] Data loading time: 3.461 sec
[meth_main]     - bam load time: 0.063 sec
[meth_main]     - fasta load time: 3.267 sec
[meth_main]     - slow5 load time: 0.125 sec
[meth_main] Data processing time: 1.315 sec
[meth_main] Data output time: 0.006 sec
[meth_main::INFO] Performance bounded by file I/O. File I/O took 2.146 sec than processing
[meth_main::WARNING] 9853 out of 9992 reads failed. Double check --pore and --rna options.
[main] Version: 1.5
[main] CMD: f5c eventalign -t 8 --slow5 oligo1_sample.blow5 -r oligo1_sample.fastq -b oligo1_RNA004_basecalled_dorado_aligned_u4_sample.bam -g ref_u2t.fa --rna --secondary=yes --min-mapq 0 --print-read-names --scale-events --samples -o oligo1_sample_eventalign.tsv --pore rna004
[main] Real time: 3.510 sec; CPU time: 13.146 sec; Peak RAM: 0.066 GB

The version of f5c is 1.5. I just downloaded the latest release.

All reads are aligned and since I'm setting --min-mapq 0 the quality shouldn't be an issue. I was suspecting that maybe because the reads are quite short maybe they're hitting some limit with a minimum length, but I didn't find anything like that in the code. I did find there's a hardcoded max_gap limit of 50, but I see that reads with long stretches of ideal matches are also missing from the output.

I'm testing this with four different samples, three of them have 3 mods interspersed within the 100nt RNA molecule and they have almost all reads filtered out (the one in the logs above and other two with similar results). The fourth sample is without any mods and it's slightly better: only 7867 out of 9992 reads failed (qc fail: 44, could not calibrate: 2449, no alignment: 5374, bad reads: 0).

Do you have any idea what may be the issue?

Best regards, Mihail

hasindu2008 commented 2 weeks ago

Hey, could you lower the following parameter and try?

--min-recalib-events INT: Minimum number of events to recalbrate (decrease if your reads are very short and could not calibrate) [default value: 200]. Introduced in f5c v0.8.
mzdravkov commented 2 weeks ago

Thank you! I must have missed this parameter in the documentation, sorry This resolved almost all no calibration errors, but had no effect on the no alignment ones. Are the "no alignment" errors indicating an issue with the alignment of the read sequence to the reference or with the alignment of the signal events?

hasindu2008 commented 2 weeks ago

Would you be able to share a little dataset containing a couple of reads? The reason could be both, if you give me a tiny dataset containing a few reads, I can have a look.

mzdravkov commented 2 weeks ago

Thank you very much for your very prompt support, I appreciate it! Here's an archive with a the blow5, sequences fasta, bam and reference fasta for 50 of the failed reads. eventalign_debug_dataset.tar.gz

Let me know if I can assist with anything else.

hasindu2008 commented 2 weeks ago

Hello

I just looks at the BLOW5 header and the information there is pretty strange. Is this a direct-RNA dataset or a genomic DNA dataset? The reason I am asking is, sequencing_kit sqk-rna002 which is likely to be direct-rna, but the experiment type is genomic_dna. Also the basecall_config_filename dna_r9.4.1_450bps_fast.cfg does not match with what it should be.

Was your original data in FAST5 format or POD5 format?

mzdravkov commented 2 weeks ago

Yes, sorry I failed to mention that. The experiment should be direct-RNA, but was done by ONT before the tools officially support RNA004, so some of the headers are wrong. I hoped that by setting the --pore flag, I could overwrite any automatic detection performed by f5c which may be confused by the unreliable headers. The original data we were given was POD5, which I converted to BLOW5 with blue-crab. The basecalling was done with dorado v0.7.1, using the rna004_130bps_sup@v5.0.0 model.

I just tried overwriting the kmer-model parameter too, using the rna004 model found in this repository, but the result was worse (no alignment reads went up from ~5400 to ~6500). But this also means that apparently f5c is picking a different model (I guess an r9 dna one) automatically. I'm not sure how I can find out which model is chosen, I tried increasing the verbosity level, but to no avail.

On a side note, my understanding is that a 9mer model is used for RNA004 (e.g. this model from ONT), but I noticed that the rna004 model for f5c is using 5mers. Why is that?

hasindu2008 commented 2 weeks ago

Ahh, I belive these were the times when ONT had beta stage rna004 flowcells. Metadata in pod5 were wrong at that time. So here it is necessary to provide --pore rna004. The inbuilt rna004 in f5c is 9-mer and is derived from that ONT's model you mentioned. The rna004 model in the repository is something we generated ourselves before ONT released the official rna004 model. We did not know the correct k-mer size at this time, so defaulted to 5-mer as was for r9. 5-mer model's alignment rate is less compared to the 9-mer model.

f5c should print something like below when the inbuilt 9-mer rna004 model is loaded: [init_core::INFO] builtin RNA004 nucleotide model loaded Hope it is clear.

I tried to align the data you gave and it fails for me too. I will carefully look at the data over the weekend and see what is the cause behind failing.

mzdravkov commented 2 weeks ago

Ok, thank you so much!

hasindu2008 commented 2 weeks ago

For debugging purposes, I will be using f5c resquiggle that will align the signals to the base-called reads. This is anyway the first step done in f5c eventalign and it is where the alignment fails.

I tried on the basecalls you gave and everything failed:

f5c resquiggle reads.fasta reads.blow5 --pore rna004 > /dev/null
[resquiggle_main::ERROR] all 50 out of 50 reads failed. Check the input files.

Then I basecalled with Dorado 0.7.2 with rna004_130bps_sup@v5.0.0 and very similar 49 of 50 failed:

/install/dorado-0.7.2/bin/dorado basecaller //install/dorado-0.7.2/models/rna004_130bps_sup@v5.0.0 reads.pod5 --emit-fastq > reads_dorado072.fastq
f5c resquiggle reads_dorado072.fastq reads.blow5 --pore rna004 > /dev/null

Then just tried using the older 3.0.1 sup model and like 17 aligned.

/install/dorado-0.7.2/bin/dorado basecaller //install/dorado-0.7.2/models/rna004_130bps_sup@v3.0.1 reads.pod5 --emit-fastq > reads_dorado072.fastq
[resquiggle_main::WARNING] 33 out of 50 reads failed. Double check --pore and --rna options.

I wonder if you happen to have any data generated from the latest RNA004 flowcells on these oligos or the basecalls for those older data generated form the basecalling software and model that was available at that time?

mzdravkov commented 2 weeks ago

Unfortunately I don't have data for this sample with the newer flowcells. I'll try to find out what was the state of the art model at the time they were sequenced and redo the basecalling with that older model. Thank you for your help! I'll let you know how that works out.