Open bpanda-dev opened 1 month ago
That is an interesting observation. I will just point out that herro window size (for the inference) is 4096 by default. I wonder if the regular spikes you see are multiples of that number. If they are, then that would be the avenue to explore.
@sivico26,
Yes, it seems to be an artefact of the model window size since these spikes are in and around the multiples of 4096 (Please refer to the figure attached, uses a different dataset from the plot in my above comment).
I had some more questions:
Thank You.
Hi @bpanda-dev
This is likely explained in the following excerpt from the consensus subsection inside methods section in the preprint (which was posted one day after your question).
If a window contains fewer than two alignments, the window is discarded, and the read is split.
Hi, I have a query regarding the following observation I found with HERRO corrected reads,
Reads get split at certain intervals more frequently.
In the following plots of read length distributions we are comparing (A) Raw reads dataset vs preprocessed-reads (The reads from the preprocess.sh script of HERRO. Done to separate porechop and duplex-tools effect from HERRO-inference). (B) Raw reads dataset vs hifiasm-ec reads dataset. (C) Raw reads dataset vs HERRO-corrected reads dataset. (D) Herro-corrected read length distribution with bin size set so that we can observe the spikes in the distribution.
We can observe in the above figure (D) that there are spikes at certain bins (denoting more reads of that particular length being in that bin) and they appear at approximately regular intervals. This was not seen in the raw reads but only in the HERRO ec reads.
Is the splitting of reads into certain intervals more frequently happening due to the GPU or the model ?
Experiment Information: Tools :
Data:
aws --no-sign-request s3 ls s3://ont-open-data/giab_2023.05/analysis/hg002/hac/PAO89685.pass.cram
. We call this small read dataset as the raw reads here.Edit: Removed read statistics table due to errors.
Thank You, Bikram Kumar Panda CDS, IISc