nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
509 stars 62 forks source link

Plasmid polyA calling #608

Closed chatla01 closed 4 months ago

chatla01 commented 8 months ago

Hi @tijyojwad,

I am using SQK-RBK114.24 kit with Dorado basecaller for plasmid sequencing. Some of my plasmids have polyA tail in the plasmids varying from 80 to 120 bp. The Dorado basecalles is not calling all the polyA (homopolymer)s. I have attached the screenshot where the alignment missing 50 to 75 bo of polyA tails.

Screenshot 2023-11-21 at 12 09 33 PM
tijyojwad commented 8 months ago

Thank you for posting this! I've marked this as a feature request in dorado and we're planning to start on this work soon.

palakpsheth commented 8 months ago

@tijyojwad we also see very similar alignments Happy to share data privately

tijyojwad commented 8 months ago

Hi @chatla01 @palakpsheth - so currently the poly A tail estimation support in dorado outputs the estimated polyA tail length as a BAM tag. this is what I had in mind for the plasmid use case too. but in this approach we don't update the actual basecalled sequence. so alignments will still show the gap you see in the screenshot above, and each record in the BAM will have a tag pt:i with the polyA length.

Would this approach work for you?

Having the basecaller output the correct long homopolymer basecall is a much more involved effort that we won't be able to address in a release or two (although we're constantly working on improving basecalling of homopolymers).

chatla01 commented 8 months ago

Hi @tijyojwad. Thank you for working on it. For my use I would prefer the correct basecaller output in fastq file. it is good to know I can check the bam file, is there anyway we can incorporate this bam pt:i information in wf-clone validation? If it is to complicated, I will wait on the next one or two iterations where I can see polyA in fastq s.

tijyojwad commented 8 months ago

Hi @chatla01 - we unfortunately don't have a timeline for fixing the basecalls in the fastq - that's still a research problem.

is there anyway we can incorporate this bam pt:i information in wf-clone validation?

I can talk to our workflows team to see if this is possible. I would imagine so

palakpsheth commented 8 months ago

Hi @chatla01 @palakpsheth - so currently the poly A tail estimation support in dorado outputs the estimated polyA tail length as a BAM tag. this is what I had in mind for the plasmid use case too. but in this approach we don't update the actual basecalled sequence. so alignments will still show the gap you see in the screenshot above, and each record in the BAM will have a tag pt:i with the polyA length.

Would this approach work for you?

Having the basecaller output the correct long homopolymer basecall is a much more involved effort that we won't be able to address in a release or two (although we're constantly working on improving basecalling of homopolymers).

This will work for our immediate needs thank you. Would it be possible to add the start position on the read the polyA was detected?

VBHerrenC commented 8 months ago

I'd love to see poly(A) support for plasmids too! Ultimately having the basecall would be most helpful but having the pt:i BAM tag for now would also be very helpful. Thanks for all the work you're doing on this!

yul96 commented 8 months ago

Indication of polyA in the pt:i tag would be also helpful. This is a repost of my previous post #100. Thank you Dorado team!

tijyojwad commented 7 months ago

Hi @yul96

Indication of polyA in the pt:i tag would be also helpful

what do you mean by this? the pt:i tag currently outputs the estimated length of the polyA.

yul96 commented 7 months ago

Hi @tijyojwad , Dorado currently does not estimate polyA in plasmid DNA, so I mean if we are not planning to estimate polyA in the basecaller, it would still be very useful to add the polyA information in the pt:i tag for plasmid DNA. I hope this is clear. thanks

VBHerrenC commented 6 months ago

Hi @tijyojwad, dorado 0.6.0 looks awesome, thanks for all of your hard work. Some great features! Is the poly(A) estimation for plasmids included in this update? Looks like there are some additional files but don't see it mentioned in the changelog. Thanks!

tijyojwad commented 6 months ago

Hi @VBHerrenC - we're working on the feature but it's not ready for primetime yet, so we haven't exposed a way to use it yet.

VBHerrenC commented 6 months ago

Totally understood, thanks for the reply! @tijyojwad

tijyojwad commented 4 months ago

Hi all - we have released initial support for plasmid polyA calling with dorado 0.7.0 . Please refer to this document to setup the config file correctly for plasmid polyA estimation.

We're looking forward to your feedback on how well this is working. I'm closing this ticket since it's an old one but please feel free to create new ones with your results/questions!

dweemx commented 4 months ago

Hi @tijyojwad,

Great that plasmid poly(A) estimation is implemented in dorado! Would the method work if there is no plasmid_rear_flank defined (i.e. in the case the linearisation of the plasmid is done just after the poly(A) ? In that case, should this parameter be left empty (i.e.: plasmid_rear_flank = "") ?

tijyojwad commented 4 months ago

Hi @dweemx - good question. right now it's designed to require both front and rear flanks. so I would just put several bases from whatever would follow the polyA if it was not linearized. but dorado should handle the case where linearization happens just after polyA

chatla01 commented 4 months ago

Hi @tijyojwad, I am assuming this doesn't work for plasmids prepped with Rapid barcoding kit? The cut site could be anywhere.

tijyojwad commented 4 months ago

Hi @chatla01 - it should; you'll need to specify the plasmid sequence you expect to flank the polyA. is that information you'll have?