New sequencing artefacts in latest basecallers?

santiago-es commented 1 year ago

Hello,

I was wondering if new sequencing artefacts beyond what was described in the Genome Biology paper have been discovered now that Guppy has undergone several iterations since the version used in the publication.

For example, in many reads which are well-anchored to subtelomere, I often see degenerate telomere sequence that looks like this:

Often it will be immediately adjacent to canonical telomere repeats. In this case, this is the sequence I see on the C-rich strand.

This data was basecalled with the Guppy 6.3 HAC promethion basecaller (R9.4.1 chemistry, 400bps)

Would there be a way to test whether this is truly an artefactual telomere using your basecaller?

ktan8 commented 1 year ago

Hi Santiago,

That's a very interesting observation you have made here with Guppy 6.3 HAC. I have not tested this model out in our initial paper, and therefore have not observed this myself.

Given that these sequences are well-anchored to the subtelomere, my initial suspicion is that they are repeat calling errors as well.

One way to check if this is the case is to plot out the theoretical current profiles of (CCTGG)n repeats and (CCCTAA)n repeats, similar to what we have done in Supplementary Figure S15 of our Genome Biology paper. If you see that these sequences share similar current profiles, it would be a good indication that they were miscalled. Also, you would want to double check if these (CCTGG)n sequences are only observed on a single strand of the nanopore sequencing data. This would give you a sense of whether they are repeat artefacts as well.

Hope this helps.

KT

santiago-es commented 1 year ago

Thanks for replying! I noticed in the S15 figure you used simulated current values that are known for the kmers but how were those obtained? I’m trying to do as you suggest and compare the theoretical current profiles.

Also, I noticed this sequence since it was only one base away from one of the "canonical" miscalls you discovered (CCCTAA--->CCCTGG). On May 18, 2023 at 1:57 PM -0700, ktan8 @.***>, wrote:

Hi Santiago, That's a very interesting observation you have made here with Guppy 6.3 HAC. I have not tested this model out in our initial paper, and therefore have not observed this myself. Given that these sequences are well-anchored to the subtelomere, my initial suspicion is that they are repeat calling errors as well. One way to check if this is the case is to plot out the theoretical current profiles of (CCTGG)n repeats and (CCCTAA)n repeats, similar to what we have done in Supplementary Figure S15 of our Genome Biology paper. If you see that these sequences share similar current profiles, it would be a good indication that they were miscalled. Also, you would want to double check if these (CCTGG)n sequences are only observed on a single strand of the nanopore sequencing data. This would give you a sense of whether they are repeat artefacts as well. Hope this helps.

• KT

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

ktan8 commented 1 year ago

Hi Santiago,

You can get the theoretical current values from the Nanopore repository:

https://github.com/nanoporetech/kmer_models/blob/master/legacy/legacy_r9.4_180mv_450bps_6mer/template_median68pA.model

Regards,

KT

On Thu, May 18, 2023 at 6:09 PM santiago-es @.***> wrote:

Thanks for replying! I noticed in the S15 figure you used simulated current values that are known for the kmers but how were those obtained? I’m trying to do as you suggest and compare the theoretical current profiles. On May 18, 2023 at 1:57 PM -0700, ktan8 @.***>, wrote:

Hi Santiago, That's a very interesting observation you have made here with Guppy 6.3 HAC. I have not tested this model out in our initial paper, and therefore have not observed this myself. Given that these sequences are well-anchored to the subtelomere, my initial suspicion is that they are repeat calling errors as well. One way to check if this is the case is to plot out the theoretical current profiles of (CCTGG)n repeats and (CCCTAA)n repeats, similar to what we have done in Supplementary Figure S15 of our Genome Biology paper. If you see that these sequences share similar current profiles, it would be a good indication that they were miscalled. Also, you would want to double check if these (CCTGG)n sequences are only observed on a single strand of the nanopore sequencing data. This would give you a sense of whether they are repeat artefacts as well. Hope this helps.

• KT

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/ktan8/nanopore_telomere_basecall/issues/9#issuecomment-1553725199, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJCPGASSSSPFTGOM7O2A4G3XG2MZZANCNFSM6AAAAAAYA3VTQQ . You are receiving this because you commented.Message ID: @.***>

santiago-es commented 1 year ago

Thanks!

I made two theoretical molecules about 120bp long of CCTAAA or CCTGG and then scanned in 6 bp windows plotting the theoretical mean current on. the y axis based on the sequence it scanned. Not exactly what you have done in S15 but looks like they might be fairly similar structure. Perhaps at the higher translocation speeds the overlap is very close? If the speed is a factor, it may explain why its so close to CCCTGG

Indeed they are very similar as might be expected from a 5-mer vs 6-mer of a nearly identical sequence

ktan8 commented 1 year ago

Hi Santiago,

Yes. They look quite similar to me. If you were to apply a vertical shift in the current profile, and a slight horizontal scaling, the current profiles will likely overlap almost perfectly with each other.

Also as mentioned, if you do not observe the reverse complementary sequence (i.e. CCAGG) sequence on reads originating from the other strand of DNA, then the (CCTGG)n repeats are likely artefacts of the basecalling process.

Hope this helps.

On Thu, May 18, 2023 at 6:31 PM santiago-es @.***> wrote:

Thanks! [image: image] https://user-images.githubusercontent.com/10642529/239393701-2c65c758-2f38-474f-bfb2-7f6b4107eabb.png

I made two theoretical molecules about 120bp long of CCTAAA or CCTGG and then scanned in 6 bp windows plotting the theoretical mean current on. the y axis based on the sequence it scanned. Not exactly what you have done in S15 but looks like they might be fairly similar structure. Perhaps at the higher translocation speeds the overlap is very close?

— Reply to this email directly, view it on GitHub https://github.com/ktan8/nanopore_telomere_basecall/issues/9#issuecomment-1553741706, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJCPGAUR2B7YVBI6L4Q57Z3XG2PNNANCNFSM6AAAAAAYA3VTQQ . You are receiving this because you commented.Message ID: @.***>

santiago-es commented 1 year ago

That makes sense, thanks so much for your help

santiago-es commented 1 year ago

Hey again, just thought I'd ask here if there are any plans to release a version of the telomere basecaller for the new R10 chemistry that is replacing the R9 this coming year.

ktan8 / nanopore_telomere_basecall

New sequencing artefacts in latest basecallers? #9