nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
493 stars 59 forks source link

trimming confusion #808

Open JWDebler opened 4 months ago

JWDebler commented 4 months ago

Hi all,

I'm a little confused about what gets trimmed when and what doesn't.

As far as I understand dorado basecaller trims adapter and barcode if demultiplexing is turned on.

My workflow works like this:

  1. simplex calling with demultiplexing
  2. extracting barcoded reads from bam file and using the ids to demultiplex the raw pod5s into barcode specific pod5s.
  3. Then I duplex call the individual barcode pod5s and run dorado trim on the resultant bam file.

My question now is, does dorado trim only trim the adapter, or will it also trim the barcodes?

Cheers, Johannes

malton-ont commented 4 months ago

Hi @JWDebler,

The trim command has no concept of barcodes - it will only trim the adapters (and primers, unless --no-trim-primers is specified).

JWDebler commented 4 months ago

Any chance that can be added? Or maybe an option that does something like 'if Adapter trimmed, also trim the next X bases'. Cheers

JWDebler commented 4 months ago

Alternatively, I could just crop 60 bp (NA top + barcode) from each end of an untrimmed read with something like chopper before assembly and skip dorado trim I suppose.

tijyojwad commented 4 months ago

Hi @JWDebler - that heuristic might work reasonably well (I'd maybe go up to 75).

Alternatively, you can adjust your pipeline to be -

  1. run dorado basecaller w/ demux and trimming enabled --> this will call all simplex reads with adapters/barcodes trimmed
  2. extract the per barcode read ids, and then run duplex for each set
  3. from the duplex output, simply keep the dx:1 reads and merge them with output from step 1.

this will keep all simplex reads + duplex reads. you can also extract read ids for dx:0 and filter those from output of 1 and merge with 3. It's a bit more effort but will handle all trimming, etc. correctly. you don't need to run trim on duplex reads because by virtue of how duplex overlapping is determined, all barcodes/adapters will get trimmed anyway.

JWDebler commented 4 months ago

Hmm, good idea. I'm gonna give that a go. I keep my simplex and duplex fastq files separate anyways so I can extract them from separate bams. Any progress on integrating all that into 'dorado duplex'? 😊

tijyojwad commented 4 months ago

Barcoding and trimming in duplex is still planned, but it has been a bit lower priority compared to some other stuff in the pipeline. So it won't make it into the upcoming release, but I'll raise priority on this for the one after that.

JWDebler commented 4 months ago

I used your suggestion above, extracting the trimmed simplex reads from the inital bam. Thanks, this works fine. There is still the odd barcode in there, but overall looks much better. However, I just had a closer look at my duplex reads, and even though I keep hearing that duplex reads should be free of adapters and barcodes due to the way they are generated, I still have lots of adapters and barcodes left on mine.

tijyojwad commented 4 months ago

I still have lots of adapters and barcodes left on mine

what barcode kit are you using?

JWDebler commented 4 months ago

SQK-NBD114-24

On Fri, 17 May 2024, 00:47 Joyjit Daw, @.***> wrote:

I still have lots of adapters and barcodes left on mine

what barcode kit are you using?

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/dorado/issues/808#issuecomment-2115749035, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBHB2QDF34YKU3ENWRYZILZCTPLHAVCNFSM6AAAAABHVXEQIOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJVG42DSMBTGU . You are receiving this because you were mentioned.Message ID: @.***>

tijyojwad commented 4 months ago

Thanks! Going through the actual structure -

For the NBD barcode, a typical read would look like this for the template strand

5' - ADAPTER1 -  FRONT_FLANK1 - BC - REAR_FLANK1 - DNA - RC(REAR_FLANK2) - RC(BC) - RC(FRONT_FLANK2) - RC(ADAPTER2) - 3'

and the complement strand

5' - ADAPTER2 - FRONT_FLANK2 - BC - REAR_FLANK2 - RC(DNA) - RC(REAR_FLANK1) - RC(BC) - RC(FRONT_FLANK1) - RC(ADAPTER1) - 3'

in this case when we determine the duplex pair overlaps, the reverse complement of the complement strand would align with the template strand. So here I would actually expect barcodes/adapters to be retained (at least on one end). So theoretically running demux/trim on the duplex output should also work!

However, if it was a different kit like RBK, then in the duplex pair overlap both adapter and barcode will be trimmed.

template => 5' - ADAPTER - FRONT_FLANK - BC - REAR_FLANK - DNA - 3'
complement => 5' - FRONT_FLANK - BC - REAR_FLANK - RC(DNA) - RC(ADAPTER) - 3'

So I apologize for the confusion earlier - whether or not barcodes/adapters get trimmed is kit dependent.

Would you be open to sharing a few reads from your duplexed barcoded dataset?

simondrue commented 4 months ago

Barcoding and trimming in duplex is still planned, but it has been a bit lower priority compared to some other stuff in the pipeline. So it won't make it into the upcoming release, but I'll raise priority on this for the one after that.

@tijyojwad Great to hear this is planned! I really need this as well. An alternative would be to simply add a feature to dorado trim that trims a specific number of bases from start or end. The problem for me is that other tools only removes the bases from the seqeunce and quality string, but does not keep the methylation information in sync.

tijyojwad commented 4 months ago

Hi @simondrue - thanks for the feedback! We're working on this now to get it out by the next release.

jonkristoffersen commented 3 months ago

Trimming barcodes is only possible during the basecalling? We have untrimmed already basecalled data (which took a month!). Barcode trimming in dorado trim would be greatly appreciated.

malton-ont commented 3 months ago

Hi @jonkristoffersen

dorado demux will trim barcodes if it is classifying, but not if using --no-classify. It is possible to re-barcode untrimmed barcoded basecall data in order to apply trimming - just ensure that you use v0.7.1 or later so that the BC tag is updated rather than a second one being created.

jonkristoffersen commented 3 months ago

Hi @jonkristoffersen

dorado demux will trim barcodes if it is classifying, but not if using --no-classify. It is possible to re-barcode untrimmed barcoded basecall data in order to apply trimming - just ensure that you use v0.7.1 or later so that the BC tag is updated rather than a second one being created.

Thanks, that worked!