Problem of caculate_base_base_shift

JeremyQuo commented 3 months ago

Hello. Here is Jeremy Quo again. I wanna ask the question of how to calculate the best base shift. I read carefully the function below,

https://github.com/hiruna72/squigualiser/blob/2389379fa8898bf78fd695b3bddac982213ea951/src/plot_utils.py#L194-L220 And now the function is as below, https://github.com/hiruna72/squigualiser/blob/2f24ff877f12c70c17cfca476201aa921e150c04/src/calculate_offsets.py#L112-L183 Here are my questions,

Is the number of best base shifts different in each read? Some are -3 and others are 0,-1,-2...
kmer_length is fixed in each sequencing data (such as DNA R941 is 6), so why there's a loop to calculate every kmer_length?
What is the input param y, is it all signals or just signals slice covered base-calling slice (signals[start_index:end_index])

hiruna72 commented 3 months ago

Hello @JeremyQuo,

You are referring to an outdated code. The latest is here on dev branch. https://github.com/hiruna72/squigualiser/blob/2f24ff877f12c70c17cfca476201aa921e150c04/src/calculate_offsets.py#L260 Could you please go throgh it an update the questions. Sorry for any inconvenience.

Generally I can answer,

Depending on the probabilistic behaviour of the basecaller, we can expect different base shifts for different reads. However, the majority of the reads of a given dataset should have the same base shift.
For better accuracy I go though all possible kmers. It is not a requirement. But it serves a purpose. When you increment the kmer length by 1 and do the calculations again to find the base shift; the newly caculated baseshift should be equal to the previous base shift from the previous iteration + 1. I validate this argument at this end https://github.com/hiruna72/squigualiser/blob/2f24ff877f12c70c17cfca476201aa921e150c04/src/calculate_offsets.py#L271

JeremyQuo commented 3 months ago

I just updated the function, you know that I am still improving my tool called nanoCEM to adapt f5c. My reviewer recommended finding the most contributing base in this link, and then I found the calculate_base_shift function.

Now I realize that the function is updated. So, forget my previous question.

Is there any function available to help me find the best base shift and the most contributing base per read?

During the development of nanoCEM, you and your group provided invaluable assistance and answered numerous questions. May I include your group in the acknowledgments section?

hiruna72 commented 3 months ago

Hello @JeremyQuo,

Thank you for acknowledging us!

Any event alignment that used a kmer model will have the same most contributing base as the kmer model.
The move tables will have different most contributing bases that depend on the basecalling model. And because of the probabilistic nature the most seen contributing base index will change slightly for some reads. In my case per read base correction was not necessary.
We have listed some such most contributing bases here. https://github.com/hiruna72/squigualiser/blob/main/docs/profiles.md
You can get different alignments outputs from different alignment tools and use squigualiser plot tracks tool to see if the bases align to each other across the alignments. Again note that I have not implemented per read base correction for move table based alignments.* dna_r10.4.1_e8.2_400bps_sup.cfg_evligned_forward_and_realign_forward.zip
To further read and incorporate any code to nanoCEM I suggest you to read our preprint's Supplementary Note 3. https://www.biorxiv.org/content/10.1101/2024.02.19.581111v2
Also Uncalled4 has a different way to represent the most contributing base using heat maps. https://www.biorxiv.org/content/10.1101/2024.03.05.583511v1

*Actually the move table based alignment in the example provided can be seen going left and right at some points for some reads. But the reads get aligned to each other after some time, then they go unaligned again and so on. Hence, per read base correction will not work in my opinion.

hiruna72 / squigualiser

Problem of caculate_base_base_shift #61