Question about squigualiser reform

JeremyQuo commented 5 months ago

Hello! Yes, me, again for asking for help. Recently, I've been working on improving my nanoCEM software, which is designed to showcase the differences in the current level of modification sites.

I want to add the support for the move table in the bam file from the ONT basecaller. The author of f5c has recommended using the squigualiser reform method, which can help me convert bam to a PAF file, which records the index of basecalled sequence. I think this method is well-suited for my approach. I plan to use the CIGAR string from the mapped BAM file to align it with the reference sequence.

However, I'm currently facing some issues. For my DNA(r941 minion sup) data, the second column in the PAF record represents the total length of the signal, which is different from the 'len_raw_signal' mentioned in the 'blow5'. I suspect that it is likely related to the '--sig_move_offset' and '--kmer_length' options. And my RNA data (r941 rnaseq 002) worked well and I used --sig_move_offset 0 --kmer_length 1

Although you provided some documentation, I need help because I'm not aware of the support for Dorado and I tried to run calculate_offsets.py but encountered an error.

Can you provide me with the accurate parameters or help me take a look and check my sample data(attached)? sample.zip

I would appreciate your assistance.

Best regards, GUO Zhihao

hiruna72 commented 5 months ago

Hi @JeremyQuo,

Not all the reads available on basecall.bam are present in the basecall.blow5.

samtools view basecall.bam | cut -f 1 | sort -k1,1 > bam_read_ids
slow5tools skim --rid basecall.blow5 | cut -f 1 | sort -k1,1 > slow5_read_ids

diff bam_read_ids slow5_read_ids 
23d22
< 4eef7089-523a-439c-805c-a35fdf7d3fc0
68a68
> cc069a0c-e231-48ee-843b-0d28d605b91f
80d79
< f2670cde-ba33-44fc-b65f-f19d260ed564

You can either add those records to blow5 or delete them from basecall.bam.

I deleted them from the basecall.bam and got these values.

running reform...
kmer_length: 1
sig_move_offset: 0
input bam: basecall.bam
output format: paf
output file: reform.paf
Info: Default stride: 5
processed_sam_record_count: 82

calculating offsets...
kmer_length: 6
sequence file: basecall.fastq
paf file: reform.paf
signal file: basecall.blow5
recommended kmer_length:5 recommended sig_move_offset:4
It is recommended to rerun reform with the recommended values. (-k 5 -m 4)

What was the command you used to do basecalling?

JeremyQuo commented 5 months ago

Thanks for your information Here is my command.

dorado basecaller dna_r9.4.1_e8_sup@v3.6/ basecall.pod5 --emit-moves >basecall.bam

I used (-k 5 -m 4) to reform, but the value of len_raw_signal is still not fully equal to the second column in paf file. Like the read 0347e4bc-1f7d-4f04-8d89-3af903358f0f , ['len_raw_signal'] is 59312, but in paf the total signal length is 59310
Additionally, I would like to ask the reason for a K-mer shift on the move table. While f5c/Nanopolish eventalign makes this assumption,it seems that the assumption of the move table is a one-to-one correspondence between each base and its corresponding current signal index. Therefore, you expect a read with 50 bases to have 50 indices. However, when applying a k-mer size of 5 by squigualiser reform -k 5, there are only 46 indices in the resulting table. Is this reasonable? Does this approach lead to greater accuracy?

hasindu2008 commented 5 months ago

@JeremyQuo Dorado is likely doing read-splitting by default which can contribute to new readIDs that are not present in original signal files. See if there is an option to disable? Or else, use Dorado through https://github.com/Psy-Fer/buttery-eel wrapper which by default would not perform read splitting if I remember right.

@hiruna72 As per spec here, the len_raw_signal in PAF should match the len_raw_signal in BLOW5. Otherwise it is a bug. Could you please investigate?

JeremyQuo commented 5 months ago

Thanks for your answer.

The read-splitting problem does not significantly impact the accuracy and skipping a few reads is acceptable. The index issue caused by the difference in len_raw_signal is more important.

In the example I provided, there is a small difference in the len_raw_signal between the BLOW5 file and the PAF file. But for the read with ID 07932ae5-951f-42d6-9354-33536c41b0f2 in my another dataset (attached). The len_raw_signal is 19964 in the BLOW5 file, but it is 19820 in the PAF file. This indicates that there might be an issue or discrepancy between the two files. This issue can affect the accuracy of indexing. another_sample.zip

But when I work with RNA data, this issue does not occur.

hiruna72 commented 5 months ago

Hello @JeremyQuo,

Thank you very much for finding this.

This is a serious issue you have figured. But the issue is not with squigualiser reform.

Consider read_id 0347e4bc-1f7d-4f04-8d89-3af903358f0f. basecall.bam auxiliary tag ns reports 59310 (this is the value extracted by reform and printed to the output)

basecall.blow5 len_raw_signal reports 59312. And I counted the number of signal points to be sure. There are 59312 signal points.

What is the original format of raw data (Fast5 or Pod5?) Can you check the original file for the signal length and the actual signal itself to make sure slow5tools f2s did not make an error?

Thanks again.

hasindu2008 commented 5 months ago

@JeremyQuo could you attach the original fast5/pod5 so hiruna could try. This could be a bug in doraodo too.

JeremyQuo commented 5 months ago

OK For the read 0347e4bc-1f7d-4f04-8d89-3af903358f0f, I attached the fast5 and pod5.

0347e4bc-1f7d-4f04-8d89-3af903358f0f_fast5.zip sample_pod5.zip

Many thanks for your help.

hiruna72 commented 5 months ago

Is this the original fast5?

If so slow5tools f2s has created the blow5 without error.

This must be a bug in a dorado when generating the ns tag values for the basecall.bam output. Could you open an issue on dorado repository regarding this?

Thank you.

JeremyQuo commented 5 months ago

Thanks for your help. This is the single format into which I converted multiple formats using the ont-fast5-api. In theory, it should be the same as the original format.

I believe ONT may have neglected the maintenance of R9 DNA, considering that most people now use R10.4.1. I will use the R10 data for plotting the move table and check this problem in R10 soon.

hasindu2008 commented 5 months ago

@JeremyQuo I basecalled your 0347e4bc-1f7d-4f04-8d89-3af903358f0f.fast5 fast5 file using Guppy as follows:

/install/ont-guppy-6.5.7/bin/guppy_basecaller -i fast5/ -s out_guppy/ --device cuda:all --config dna_r9.4.1_450bps_sup.cfg --moves_out --bam_out

which gives ns:i:59312

Then I converted your 0347e4bc-1f7d-4f04-8d89-3af903358f0f.fast5 to a blow5 using slow5tools f2s and used buttery-eel+dorado-server to directly basecall the BLOW5.

/install/buttery-eel-0.4.2+dorado7.2.13/scripts/eel -i a.blow5 -o a.sam --device cuda:all --config dna_r9.4.1_450bps_sup.cfg --moves_out

This gives ns:i:59312.

However, when I try to basecall the single-fast5 using dorado standalone it errors out.

/install/dorado-0.3.4/bin/dorado basecaller /install/dorado-0.3.4/models/dna_r9.4.1_e8_sup@v3.6 fast5/ --device cuda:all --emit-moves --emit-sam > b.sam
[2024-02-03 09:03:06.614] [info] > Creating basecall pipeline
HDF5-DIAG: Error detected in HDF5 (1.8.12) thread 0:
  #000: ../../src/H5G.c line 463 in H5Gopen2(): unable to open group
    major: Symbol table
    minor: Can't open object
  #001: ../../src/H5Gint.c line 320 in H5G__open_name(): group not found
    major: Symbol table
    minor: Object not found
  #002: ../../src/H5Gloc.c line 430 in H5G_loc_find(): can't find object
    major: Symbol table
    minor: Object not found
  #003: ../../src/H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #004: ../../src/H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #005: ../../src/H5Gloc.c line 385 in H5G_loc_find_cb(): object 'channel_id' doesn't exist
    major: Symbol table
    minor: Object not found
[2024-02-03 09:03:06.666] [error] Unable to open the group "channel_id": (Symbol table) Object not found

Seems like ONT cannot handle their own FAST5 mess they created (gonna happen for pod5 in the future the way it is going, actually already happened).

Then I converted my BLOW5 file to POD5 using blue-crab and gave that POD5 to Dorado.

 blue-crab s2p a.blow5 -o a.pod5
 /install/dorado-0.3.4/bin/dorado basecaller /install/dorado-0.3.4/models/dna_r9.4.1_e8_sup@v3.6 pod5/ --device cuda:all --emit-moves --emit-sam > b.sam

That gives ns:i:59312 too.

Can you basecall the attached pod5 file using your dorado and see what the ns tag is?

a.zip

JeremyQuo commented 5 months ago

@hasindu2008

Many thanks for your attention, I think we find the real reason.

I rerun the basecaller on the pod5 you gived me, the ns tag is still 59310 not 59312

Then I checked my dorado version is 0.5.2, when I convert it to the old versio 0.4.3, it become 59312.

In conclusion, this is a bug in dorado v0.5.2

hasindu2008 commented 5 months ago

Oh no Dorado. This is gonna be an annoying bug for everyone like us. Could you please open an issue in dorado to report this?

JeremyQuo commented 5 months ago

OK. I will do it. Apart from that, I would like to raise an optimization issue. Since the squigualiser has script to determine sig_move_offset and kmer_length, is it possible to automatically determine it in the squigualiser reform and then generate the PAF file? This would help reduce additional steps and associated costs.

hiruna72 commented 5 months ago

Hi @JeremyQuo,

Precomputed values are available for some models. https://github.com/hiruna72/squigualiser/blob/dev/docs/reform.md#precomputed-kmer-lengths-and-signal-move-offsets You can directly use them with -k and -m parameters or pass --profile.

But as you have figured I don't recommend relying on them that much as ONT is actively making changes.

We will try to update and maintain precomputed values.

My suggestion is to create a subset of about 100 reads of your dataset and run calculate_offsets to determine the best -k and -m values.

And even make a density plot for a particular read to confirm the results.

Basic pipelines are available here with small datasets. https://github.com/hiruna72/squigualiser/tree/main/test/data/raw/pipelines

If you mean to embed calculate_offset part inside reform, yes, that's possible. I will implement that in the future. Thanks for the suggestion.

HalfPhoton commented 5 months ago

@hasindu2008 ,

Single read fast5s are not supported by dorado which is why you're seeing this error [2024-02-03 09:03:06.666] [error] Unable to open the group "channel_id": (Symbol table) Object not found https://github.com/hiruna72/squigualiser/issues/53#issuecomment-1924780570

We'll improve the error message in future.

Regards

hiruna72 / squigualiser

Question about squigualiser reform #53