Observation: Reads get split at certain intervals more often.

bpanda-dev commented 1 month ago

Hi, I have a query regarding the following observation I found with HERRO corrected reads,

Reads get split at certain intervals more frequently.

In the following plots of read length distributions we are comparing (A) Raw reads dataset vs preprocessed-reads (The reads from the preprocess.sh script of HERRO. Done to separate porechop and duplex-tools effect from HERRO-inference). (B) Raw reads dataset vs hifiasm-ec reads dataset. (C) Raw reads dataset vs HERRO-corrected reads dataset. (D) Herro-corrected read length distribution with bin size set so that we can observe the spikes in the distribution.

Screenshot 2024-05-20 at 11 45 25 AM [Please open this image in a new tab to zoom].

We can observe in the above figure (D) that there are spikes at certain bins (denoting more reads of that particular length being in that bin) and they appear at approximately regular intervals. This was not seen in the raw reads but only in the HERRO ec reads.

Is the splitting of reads into certain intervals more frequently happening due to the GPU or the model ?

Experiment Information: Tools :

NanoPlot(https://github.com/wdecoster/NanoPlot) was used to get the read statistics.
Minimap2: Used to create the small read dataset of raw reads mapping to hg002 chr19(both haplotypes).
Hifiasm: To get the Hifiasm-ec reads and compare the HERRO reads against them.

Data:

We created a small read dataset of all reads mapping to hg002 chr19 both haplotypes from the ONT read data available from https://labs.epi2me.io/giab-2023.05/ at this link aws --no-sign-request s3 ls s3://ont-open-data/giab_2023.05/analysis/hg002/hac/PAO89685.pass.cram. We call this small read dataset as the raw reads here.

Edit: Removed read statistics table due to errors.

Thank You, Bikram Kumar Panda CDS, IISc

sivico26 commented 1 month ago

That is an interesting observation. I will just point out that herro window size (for the inference) is 4096 by default. I wonder if the regular spikes you see are multiples of that number. If they are, then that would be the avenue to explore.

bpanda-dev commented 1 month ago

@sivico26, Yes, it seems to be an artefact of the model window size since these spikes are in and around the multiples of 4096 (Please refer to the figure attached, uses a different dataset from the plot in my above comment). read_length_distribution_around_4096_multiples

I had some more questions:

How does this splitting pattern affect the output assembly now that we have shorter reads?
Why are only some raw reads split this way and not all? Does this happen when the read error rate is high at the edges of windows(4096 length)?

Thank You.

pgonzale60 commented 1 week ago

Hi @bpanda-dev

This is likely explained in the following excerpt from the consensus subsection inside methods section in the preprint (which was posted one day after your question).

If a window contains fewer than two alignments, the window is discarded, and the read is split.

lbcb-sci / herro

Observation: Reads get split at certain intervals more often. #28

Reads get split at certain intervals more frequently.