hasindu2008 / f5c

Ultra-fast methylation calling and event alignment tool for nanopore sequencing data (supports CUDA acceleration)
https://hasindu2008.github.io/f5c/docs/overview
MIT License
144 stars 26 forks source link

Error during eventalign: Read ID 'X' was not found. Slow5 record will be skipped. #167

Closed tjprins closed 5 months ago

tjprins commented 5 months ago

Hello,

I am using m6ANet to detect m6A modifications in RNA and recently swapped over to the new nanopore RNA004 chemistry/kit. Because of this, my pipeline uses ONT's new basecaller, dorado, and I am using f5c instead of nanopolish, per the updated m6anet documentation. Everything seems to go smoothly with the new pipeline until I get to the f5c eventalign step, in which I get a lot of these messages:

[slow5_idx_get::ERROR] Read ID '261cc311-3a2c-40bc-ab69-df48a048aa2c' was not found. At src/slow5_idx.c:528
[read_slow5_single::WARNING] Slow5 record for read [261cc311-3a2c-40bc-ab69-df48a048aa2c] is unavailable/unreadable and will be skipped

At the end of the eventalign execution, it spits out the following:

[slow5_idx_get::ERROR] Read ID '261cc311-3a2c-40bc-ab69-df48a048aa2c' was not found. At src/slow5_idx.c:528
[read_slow5_single::WARNING] Slow5 record for read [261cc311-3a2c-40bc-ab69-df48a048aa2c] is unavailable/unreadable and will be skipped
[meth_main::772.016*7.95] 434 Entries (0.1M bases) loaded
[pthread_processor::772.050*7.95] 434 Entries (0.1M bases) processed
[meth_main] skipped unmapped: 0, skipped secondary: 0, skipped low_mapq: 1019994
[meth_main] total entries: 650162, qc fail: 115, could not calibrate: 16766, no alignment: 39878, bad reads: 23954
[meth_main] total bases: 753.6 Mbases
[meth_main] Data loading time: 387.301 sec
[meth_main]     - bam load time: 12.892 sec
[meth_main]     - fasta load time: 209.299 sec
[meth_main]     - slow5 load time: 164.516 sec
[meth_main] Data processing time: 724.617 sec
[meth_main] Data output time: 72.186 sec
[meth_main::WARNING] Skipped 1019994 reads with MAPQ < 20. Use --min-mapq to change the threshold.
[main] Version: 1.4
[main] CMD: f5c eventalign --rna -b /home/tj/Research/Data/Aligned/transcriptome/GRCh38_gv45_transcriptome/FA68LB4766WT624_D0.7.0/FA68LB4766WT624_aligned.bam -r FA68LB4766WT624_master.fastq -g /home/tj/Research/Data/Ref/GRCh38.gv45.transcriptome_wM6A_RCS_Controls.fa -o FA68LB4766WT624_eventalign.tsv --kmer-model /home/tj/Research/m6anet_models/rna004.nucleotide.5mer.model --slow5 /home/tj/Research/Data/Raw/FA68LB4766WT624/blow5/FA68LB4766WT624_master.blow5 --signal-index --scale-events
[main] Real time: 772.118 sec; CPU time: 6141.177 sec; Peak RAM: 1.554 GB

Not sure what is going on, but it looks like it is having trouble finding a lot of the reads after basecalling or pod5 -> blow5 conversion. My pipeline is the following: basecalling (dorado, pod5 as input) > alignment (minimap2) > pod5 to blow5 conversion (blue-crab) > f5c (index) > f5c (eventalign). For inclusion, I have included the commands and output of each of these steps from the console below.

Any insight or help is very much appreciated. Thank you!

Dorado:

$DORADO/dorado ${DORA}             $NRAW/${SAMPLEID}/pod5 > $NBASE/$SBPATH/${SAMPLEID}_master.fastq 
[2024-06-21 11:49:42.723] [info] Running: "basecaller" "--emit-fastq" "rna004_130bps_fast@v5.0.0" "/home/tj/Research/Data/Raw/FA68LB4766WT624/pod5"
[2024-06-21 11:49:42.728] [info]  - Note: FASTQ output is not recommended as not all data can be preserved.
[2024-06-21 11:49:42.731] [info] > Creating basecall pipeline
[2024-06-21 11:49:42.742] [info]  - BAM format does not support `U`, so RNA output files will include `T` instead of `U` for all file types.
[2024-06-21 11:49:53.246] [info] cuda:0 using chunk size 10000, batch size 5952
[2024-06-21 11:49:53.491] [info] cuda:0 using chunk size 5000, batch size 5952
[2024-06-21 11:55:23.991] [info] > Simplex reads basecalled: 1570565
[2024-06-21 11:55:23.991] [info] > Simplex reads filtered: 3640
[2024-06-21 11:55:23.991] [info] > Basecalled @ Samples/s: 1.589797e+08
[2024-06-21 11:55:24.377] [info] > Finished

Minimap2

minimap2 ${MM2T}         $GREF/${TREFERENCEID}         $NBASE/$SBPATH/${SAMPLEID}_master.fastq > $NALIT/$SBPATH/${SAMPLEID}_aligned_unsorted.sam
[M::mm_idx_gen::3.664*1.47] collected minimizers
[M::mm_idx_gen::4.535*1.76] sorted minimizers
[M::main::4.562*1.76] loaded/built the index for 252933 target sequence(s)
[M::mm_mapopt_update::4.805*1.72] mid_occ = 125
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 252933
[M::mm_idx_stat::4.952*1.70] distinct minimizers: 20853516 (42.14% are singletons); average occurrences: 3.899; average spacing: 5.397; total length: 438834164
[M::worker_pipeline::77.894*2.93] mapped 491939 sequences
[M::worker_pipeline::149.465*2.97] mapped 519978 sequences
[M::worker_pipeline::220.838*2.99] mapped 492663 sequences
[M::worker_pipeline::236.651*2.99] mapped 123995 sequences
[M::main] Version: 2.28-r1209
[M::main] CMD: minimap2 -ax map-ont -uf --secondary=no /home/tj/Research/Data/Ref/GRCh38.gv45.transcriptome_wM6A_RCS_Controls.fa /home/tj/Research/Data/Basecalled/FA68LB4766WT624_D0.7.0/FA68LB4766WT624_master.fastq
[M::main] Real time: 236.693 sec; CPU: 707.104 sec; Peak RSS: 5.263 GB

Blue-crab

blue-crab p2s $NRAW/${SAMPLEID}/pod5 -o $NRAW/${SAMPLEID}/blow5/${SAMPLEID}_master.blow5
21-Jun-24 12:33:27 - blue-crab - [INFO]: many2single: 23 pod5 files detected as input. Writing many pod5 to one s/blow5 file: /home/tj/Research/Data/Raw/FA68LB4766WT624/blow5/FA68LB4766WT624_master.blow5
21-Jun-24 12:33:27 - blue-crab - [INFO]: opening slow5 file: /home/tj/Research/Data/Raw/FA68LB4766WT624/blow5/FA68LB4766WT624_master.blow5
21-Jun-24 12:33:27 - blue-crab - [INFO]: m2s: Reading 23 pod5 file/s
21-Jun-24 12:33:27 - blue-crab - [INFO]: Writing header - limited to 1 read group for now, split your pod5 if it's a merged file
21-Jun-24 12:33:27 - blue-crab - [INFO]: Slow5 header written
21-Jun-24 12:38:51 - blue-crab - [INFO]: pod5 -> s/blow5 complete

f5c (index)

f5c index --slow5 $NRAW/${SAMPLEID}/blow5/${SAMPLEID}_master.blow5 $NBASE/$SBPATH/${SAMPLEID}_master.fastq
[index_main] Slow5 index built - took 11.506s
[index_main] Fasta index built - took 54.195s
[main] Version: 1.4
[main] CMD: f5c index --slow5 /home/tj/Research/Data/Raw/FA68LB4766WT624/blow5/FA68LB4766WT624_master.blow5 /home/tj/Research/Data/Basecalled/FA68LB4766WT624_D0.7.0/FA68LB4766WT624_master.fastq
[main] Real time: 65.895 sec; CPU time: 209.262 sec; Peak RAM: 0.509 GB
hasindu2008 commented 5 months ago

Hi @tjprins

Would you mind grepping this "261cc311-3a2c-40bc-ab69-df48a048aa2c" from the FASTQ? I reckon these are "split reads" where new Dorado splits a given read into two reads when basecalling. See if the FASTQ header has something like a different parent read ID?

Which version of Dorado are you using? I can try on my end too.

By the way, I see that a lot of reads have been skipped as < mapq 20. You can ask f5c to include in the analysis through --min-mapq 0. M6anet authors have tested this and have confirmed mapq 0 threshold is all right https://github.com/hasindu2008/f5c/issues/154#issuecomment-1960747253.

tjprins commented 5 months ago

Hi hasindu2008, thanks for such a quick response.

Would you mind grepping this "261cc311-3a2c-40bc-ab69-df48a048aa2c" from the FASTQ? I reckon these are "split reads" where new Dorado splits a given read into two reads when basecalling. See if the FASTQ header has something like a different parent read ID?

I saw a thread about this here when I was searching my issue. I looked into dorado's documentation as best I could to try to find a way to not split the reads, but I didn't see anything. If I run this data through guppy -> nanopolish I don't get this error, so you could be right, but I need dorado since guppy doesn't have a model for the new chemistry.

Below is the output of grep '261cc311-3a2c-40bc-ab69-df48a048aa2c' FA68LB4766WT624_master.fastq

@261cc311-3a2c-40bc-ab69-df48a048aa2c   st:Z:2024-06-21T00:16:42.545+00:00  RG:Z:4f29a5c9f8af0ea181004166b475ce6b83155cfc_rna004_130bps_fast@v5.0.0 DS:Z:gpu:NVIDIA GeForce RTX 3090 Ti

Which version of Dorado are you using? I can try on my end too.

I am using dorado 0.7.0. If you want me to make these files available to you so you can try on your end I can certainly do that.

By the way, I see that a lot of reads have been skipped as < mapq 20. You can ask f5c to include in the analysis through --min-mapq 0. M6anet authors have tested this and have confirmed mapq 0 threshold is all right https://github.com/hasindu2008/f5c/issues/154#issuecomment-1960747253.

Good to know, I will add that into my pipeline. For what it's worth, when I tried this, I got the same result.

Thanks again for all your help. I know it's gotta be rough to help us all troubleshoot our issues, but it makes a huge difference!!

hasindu2008 commented 5 months ago

Hello @tjprins

While it is time consuming to troubleshoot issues, it is also a pleasure to help the community in the way we can for the software we developed and continue to maintain.

I tested on my rna004 dataset using latest dorado and it seems read splitting is enabled by default (as we guessed).

I called:

 /install/dorado-0.7.2-linux-x64/bin/dorado basecaller /data/install/dorado-0.7.2-linux-x64/model/rna004_130bps_fast@v5.0.0/ pod5_shit/ > tmp.sam
 samtools view tmp.sam  | cut -f 1  > tmp.list
slow5tools get PNXRXX240010_reads_20k.blow5 --list tmp.list > /dev/null

and it gave like:

[slow5_idx_get::ERROR] Read ID 'f64dbcee-68ae-4cc4-8a86-2eef57d3dc14' was not found. At src/slow5_idx.c:539

Now grepping that read ID in sam file:

 samtools view tmp.sam | grep "f64dbcee-68ae-4cc4-8a86-2eef57d3dc14"
f64dbcee-68ae-4cc4-8a86-2eef57d3dc14    4       *       0       0       *       *       0       0       GGCGGCGGCGGCGGCCA       """""""""""""""""       qs:f:1  du:f:2.07425    ns:i:8297   ts:i:0  mx:i:4  ch:i:1177       st:Z:2024-01-11T07:50:55.557+00:00      rn:i:-1 fn:Z:PNXRXX240010_reads_20k.pod5        sm:f:806.16     sd:f:118.119    sv:Z:pa dx:i:0  RG:Z:844f1e7a1a1f3f41ac89a9971f789f21b3e161d2_rna004_130bps_fast@v5.0.0     pi:Z:ff6b33b0-f24f-4af8-9edb-0bae0b7f0d10       sp:i:0

That pi:Z:ff6b33b0-f24f-4af8-9edb-0bae0b7f0d10 indicates that this missing read is indeed a split read whose parent is ff6b33b0-f24f-4af8-9edb-0bae0b7f0d10.

This parent read id available in the slow5 file as indicated by:

slow5tools get PNXRXX240010_reads_20k.blow5 ff6b33b0-f24f-4af8-9edb-0bae0b7f0d10 > /dev/null

[main] cmd: slow5tools get PNXRXX240010_reads_20k.blow5 ff6b33b0-f24f-4af8-9edb-0bae0b7f0d10
[main] real time = 0.009 sec | CPU time = 0.010 sec | peak RAM = 0.006 GB

Now given that I confirmed that this is due to split reads, you can continue to simply ignore these missing reads in f5c. From your report it seems like only 23954 bad reads (which are these missing reads) out of 650162 which is like 4% of reads being lost.

[meth_main] total entries: 650162, qc fail: 115, could not calibrate: 16766, no alignment: 39878, bad reads: 23954

In theory, I could handle these parent reads ids in f5c, but this is not practically possible because this parent read id information is missing in the fastq format. Even if it was there, nanopore will keep on changing the names and conventions, making it a time consuming to maintain.

It appears that Dorado does not have an option to disable read splitting when I skimmed through the options. The easiest option is simply ignoring the lost 4% of reads. I know those missing reads spams the terminal with errors and warnings, but in the next release I am going to suppress these warnings for split reads and simply print a single warning at the end for the number of skipped reads. Alternatively, you use the buttery-eel which uses dorado server to directly basecall blow5 files. I tested with Dorado server 7.2.13 as:

 /install/buttery-eel-0.4.2+dorado7.2.13/scripts/eel --config  rna_rp4_130bps_fast.cfg -i PNXRXX240010_reads_20k.blow5 -o  tmp.fastq -x cuda:all

and the read splitting option is disabled by default. rna_rp4_130bps_fast.cfg is for rna004, but the model version is rna004_130bps_fast@v3.0.1 which is behind the rna004_130bps_fast@v5.0.0 the latest Dorado has. I haven't got to test buttery-eel on latest dorado-server to confirm what the read splitting behaviour is, but @Psy-Fer should be able to comment on.

A lot of information here, hope I was clear.

tjprins commented 5 months ago

Thanks, Hasindu. I followed your example and gave buttery-eel a shot with the dorado-server and can confirm that I am no longer having the issue and that all of my reads are now being faithfully reported. Thank you so much for your assistance, this has been immensely helpful. I will probably use dorado-server going forward (especially because I can use it directly on blow5 files with buttery-eel).

hasindu2008 commented 5 months ago

Great! I will close the issue. If you run into another issue or any other question, feel free to open a new issue.