Closed JeremyQuo closed 5 months ago
Hi @JeremyQuo,
Not all the reads available on basecall.bam are present in the basecall.blow5.
samtools view basecall.bam | cut -f 1 | sort -k1,1 > bam_read_ids
slow5tools skim --rid basecall.blow5 | cut -f 1 | sort -k1,1 > slow5_read_ids
diff bam_read_ids slow5_read_ids
23d22
< 4eef7089-523a-439c-805c-a35fdf7d3fc0
68a68
> cc069a0c-e231-48ee-843b-0d28d605b91f
80d79
< f2670cde-ba33-44fc-b65f-f19d260ed564
You can either add those records to blow5 or delete them from basecall.bam.
I deleted them from the basecall.bam and got these values.
running reform...
kmer_length: 1
sig_move_offset: 0
input bam: basecall.bam
output format: paf
output file: reform.paf
Info: Default stride: 5
processed_sam_record_count: 82
calculating offsets...
kmer_length: 6
sequence file: basecall.fastq
paf file: reform.paf
signal file: basecall.blow5
recommended kmer_length:5 recommended sig_move_offset:4
It is recommended to rerun reform with the recommended values. (-k 5 -m 4)
What was the command you used to do basecalling?
Thanks for your information Here is my command.
dorado basecaller dna_r9.4.1_e8_sup@v3.6/ basecall.pod5 --emit-moves >basecall.bam
I used (-k 5 -m 4)
to reform, but the value of len_raw_signal
is still not fully equal to the second column in paf file.
Like the read 0347e4bc-1f7d-4f04-8d89-3af903358f0f
, ['len_raw_signal'] is 59312
, but in paf the total signal length is 59310
Additionally, I would like to ask the reason for a K-mer shift on the move table
. While f5c/Nanopolish eventalign makes this assumption,it seems that the assumption of the move table is a one-to-one correspondence between each base and its corresponding current signal index. Therefore, you expect a read with 50
bases to have 50
indices. However, when applying a k-mer size of 5 by squigualiser reform -k 5
, there are only 46
indices in the resulting table. Is this reasonable? Does this approach lead to greater accuracy?
@JeremyQuo Dorado is likely doing read-splitting by default which can contribute to new readIDs that are not present in original signal files. See if there is an option to disable? Or else, use Dorado through https://github.com/Psy-Fer/buttery-eel wrapper which by default would not perform read splitting if I remember right.
@hiruna72 As per spec here, the len_raw_signal
in PAF should match the len_raw_signal
in BLOW5. Otherwise it is a bug. Could you please investigate?
Thanks for your answer.
The read-splitting problem does not significantly impact the accuracy and skipping a few reads is acceptable.
The index issue caused by the difference in len_raw_signal
is more important.
In the example I provided, there is a small difference in the len_raw_signal
between the BLOW5
file and the PAF
file.
But for the read with ID 07932ae5-951f-42d6-9354-33536c41b0f2
in my another dataset (attached). The len_raw_signal
is 19964
in the BLOW5
file, but it is 19820
in the PAF file. This indicates that there might be an issue or discrepancy between the two files. This issue can affect the accuracy of indexing.
another_sample.zip
But when I work with RNA data, this issue does not occur.
Hello @JeremyQuo,
Thank you very much for finding this.
This is a serious issue you have figured. But the issue is not with squigualiser reform
.
Consider read_id 0347e4bc-1f7d-4f04-8d89-3af903358f0f
.
basecall.bam
auxiliary tag ns
reports 59310
(this is the value extracted by reform
and printed to the output)
basecall.blow5 len_raw_signal
reports 59312
. And I counted the number of signal points to be sure. There are 59312
signal points.
What is the original format of raw data (Fast5 or Pod5?)
Can you check the original file for the signal length and the actual signal itself to make sure slow5tools f2s
did not make an error?
Thanks again.
@JeremyQuo could you attach the original fast5/pod5 so hiruna could try. This could be a bug in doraodo too.
OK
For the read 0347e4bc-1f7d-4f04-8d89-3af903358f0f
, I attached the fast5 and pod5.
0347e4bc-1f7d-4f04-8d89-3af903358f0f_fast5.zip sample_pod5.zip
Many thanks for your help.
Is this the original fast5?
If so slow5tools f2s
has created the blow5 without error.
This must be a bug in a dorado when generating the ns
tag values for the basecall.bam
output. Could you open an issue on dorado repository regarding this?
Thank you.
Thanks for your help.
This is the single format into which I converted multiple formats using the ont-fast5-api
. In theory, it should be the same as the original format.
I believe ONT may have neglected the maintenance of R9 DNA, considering that most people now use R10.4.1. I will use the R10 data for plotting the move table and check this problem in R10 soon.
@JeremyQuo
I basecalled your 0347e4bc-1f7d-4f04-8d89-3af903358f0f.fast5
fast5 file using Guppy as follows:
/install/ont-guppy-6.5.7/bin/guppy_basecaller -i fast5/ -s out_guppy/ --device cuda:all --config dna_r9.4.1_450bps_sup.cfg --moves_out --bam_out
which gives ns:i:59312
Then I converted your 0347e4bc-1f7d-4f04-8d89-3af903358f0f.fast5
to a blow5 using slow5tools f2s and used buttery-eel+dorado-server to directly basecall the BLOW5.
/install/buttery-eel-0.4.2+dorado7.2.13/scripts/eel -i a.blow5 -o a.sam --device cuda:all --config dna_r9.4.1_450bps_sup.cfg --moves_out
This gives ns:i:59312
.
However, when I try to basecall the single-fast5 using dorado standalone it errors out.
/install/dorado-0.3.4/bin/dorado basecaller /install/dorado-0.3.4/models/dna_r9.4.1_e8_sup@v3.6 fast5/ --device cuda:all --emit-moves --emit-sam > b.sam
[2024-02-03 09:03:06.614] [info] > Creating basecall pipeline
HDF5-DIAG: Error detected in HDF5 (1.8.12) thread 0:
#000: ../../src/H5G.c line 463 in H5Gopen2(): unable to open group
major: Symbol table
minor: Can't open object
#001: ../../src/H5Gint.c line 320 in H5G__open_name(): group not found
major: Symbol table
minor: Object not found
#002: ../../src/H5Gloc.c line 430 in H5G_loc_find(): can't find object
major: Symbol table
minor: Object not found
#003: ../../src/H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
major: Symbol table
minor: Object not found
#004: ../../src/H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
major: Symbol table
minor: Callback failed
#005: ../../src/H5Gloc.c line 385 in H5G_loc_find_cb(): object 'channel_id' doesn't exist
major: Symbol table
minor: Object not found
[2024-02-03 09:03:06.666] [error] Unable to open the group "channel_id": (Symbol table) Object not found
Seems like ONT cannot handle their own FAST5 mess they created (gonna happen for pod5 in the future the way it is going, actually already happened).
Then I converted my BLOW5 file to POD5 using blue-crab and gave that POD5 to Dorado.
blue-crab s2p a.blow5 -o a.pod5
/install/dorado-0.3.4/bin/dorado basecaller /install/dorado-0.3.4/models/dna_r9.4.1_e8_sup@v3.6 pod5/ --device cuda:all --emit-moves --emit-sam > b.sam
That gives ns:i:59312
too.
Can you basecall the attached pod5 file using your dorado and see what the ns tag is?
@hasindu2008
Many thanks for your attention, I think we find the real reason.
I rerun the basecaller on the pod5 you gived me, the ns
tag is still 59310
not 59312
Then I checked my dorado version is 0.5.2, when I convert it to the old versio 0.4.3, it become 59312
.
In conclusion, this is a bug in dorado v0.5.2
Oh no Dorado. This is gonna be an annoying bug for everyone like us. Could you please open an issue in dorado to report this?
OK. I will do it.
Apart from that, I would like to raise an optimization issue. Since the squigualiser has script to determine sig_move_offset
and kmer_length
, is it possible to automatically determine it in the squigualiser reform and then generate the PAF file? This would help reduce additional steps and associated costs.
Hi @JeremyQuo,
Precomputed values are available for some models. https://github.com/hiruna72/squigualiser/blob/dev/docs/reform.md#precomputed-kmer-lengths-and-signal-move-offsets
You can directly use them with -k
and -m
parameters or pass --profile
.
But as you have figured I don't recommend relying on them that much as ONT is actively making changes.
We will try to update and maintain precomputed values.
My suggestion is to create a subset of about 100 reads of your dataset and run calculate_offsets
to determine the best -k
and -m
values.
And even make a density plot for a particular read to confirm the results.
Basic pipelines are available here with small datasets. https://github.com/hiruna72/squigualiser/tree/main/test/data/raw/pipelines
If you mean to embed calculate_offset part inside reform
, yes, that's possible.
I will implement that in the future. Thanks for the suggestion.
@hasindu2008 ,
Single read fast5s are not supported by dorado which is why you're seeing this error [2024-02-03 09:03:06.666] [error] Unable to open the group "channel_id": (Symbol table) Object not found
https://github.com/hiruna72/squigualiser/issues/53#issuecomment-1924780570
We'll improve the error message in future.
Regards
Hello! Yes, me, again for asking for help. Recently, I've been working on improving my nanoCEM software, which is designed to showcase the differences in the current level of modification sites.
I want to add the support for the move table in the bam file from the ONT basecaller. The author of
f5c
has recommended using thesquigualiser reform
method, which can help me convert bam to a PAF file, which records the index of basecalled sequence. I think this method is well-suited for my approach. I plan to use the CIGAR string from the mapped BAM file to align it with the reference sequence.However, I'm currently facing some issues. For my DNA(r941 minion sup) data, the second column in the PAF record represents the total length of the signal, which is different from the
'len_raw_signal'
mentioned in the 'blow5'. I suspect that it is likely related to the '--sig_move_offset' and '--kmer_length' options. And my RNA data (r941 rnaseq 002) worked well and I used--sig_move_offset 0 --kmer_length 1
Although you provided some documentation, I need help because I'm not aware of the support for Dorado and I tried to run calculate_offsets.py but encountered an error.![image](https://github.com/hiruna72/squigualiser/assets/76717431/fa026844-4e4e-4a6a-9b2a-6f3c42283f1f)
Can you provide me with the accurate parameters or help me take a look and check my sample data(attached)? sample.zip
I would appreciate your assistance.
Best regards, GUO Zhihao