nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
531 stars 63 forks source link

Using --trim does not remove any bases #607

Open femiliani opened 9 months ago

femiliani commented 9 months ago

I am wondering if I am using the trim function incorrectly. The adaptors remain in the reads regardless of whether I basecall with --trim all or pass the basecalled reads through the trim module. When search for the adaptor sequences i find it in most reads regardless of whether they are untrimmed, basecalled+trimmed, or post-trimmed. additionally, if we trim with dorado, does the MN tag and associated methylation information remain intact?

My code: dorado basecaller --trim all /path/dna_r10.4.1_e8.2_400bps_sup@v4.3.0 ./pod5 --modified-bases 5mCG_5hmCG -l ./readids/readnames.txt -b 350 --min-qscore 10 --emit-moves > basecalled.bam and dorado trim basecalled.sam > trim.bam

My version of dorado: 0.5.2+7969fab

When i try to trim using Porechop: Dorado without trimming: IMG_3206

Dorado with trimming: IMG_3454

Thanks a bunch!

tijyojwad commented 9 months ago

Hi @femiliani - I don't see any issues with your command line and dorado should be trimming bases in this case.

Would you be open to sharing a subset of reads from your dataset for us to take a look? Can you also share what kind of library prep you're using and the read length distribution?

femiliani commented 9 months ago

Of course @tijyojwad, here are a couple of the files: https://drive.google.com/drive/folders/1lKNmQNAM2g0q2Qu3HpDjOn7gXJGUY3Jz?usp=sharing These are shorter reads (~1kb), prepped with SQK-NBD114-24 LengthvsQualityScatterPlot_dot

We've since tested it on another data set (sample was very similar but the preps were with the non-barcoded kits) which we subset and analyzed in batches and have had inconsistent trimming results, despite running the same analysis on all of the batches. Some, when run through porechop, had 90% adapters still present (similar to my original post), others have had 40% or 10% left behind, and in one case none were found. I suspect this last case was because the first few reads porechop examined for adapters were trimmed and so it never determined which adapters to look for, unfortunately we had to take the computer offline for some hardware updates so i couldn't do further analysis to verify that.

Thanks so much for your help, curious as to what might be going on!

tijyojwad commented 9 months ago

Hi @femiliani thank you so much for sharing your data and distribution. we'll take a closer look and get back to you

tijyojwad commented 9 months ago

Hi @femiliani - can you also share some of the non-barcoded data?

jkh00 commented 9 months ago

Hi,

Jumping on this, I also noticed the same issue with using dorado trim, where it doesnt seems to be trimming adapters. When I do:

mod basecalling > dorado trim > validate with porechop

Presence of adapters and barcodes were recorded from porechop

SQK-NSK007_Y_Top: AATGTACTTCGTTCAGTTACGTATTGCT SQK-NSK007_Y_Bottom: GCAATACGTAACTGAACGAAGT BC50: ATGGACTTTGGTAACTTCCTGCGT BC50_rev: ACGCAGGAAGTTACCAAAGTCCAT BC51: GTTGAATGAGCCTACTGGGTCCTC BC51_rev: GAGGACCCAGTAGGCTCATTCAAC BC52: TGAGAGACAAGATTGTTCGTGGAC BC52_rev: GTCCACGAACAATCTTGTCTCTCA

On top of that, I also noticed that while the demultiplexing was successful (using dorado demux), the barcode sequences were not removed from the output and the adapters seems to be trimmed with dorado demux rather than in 'dorado trim'.

When I do:

mod basecalling > dorado trim > dorado demux > validate with porechop

Presence of barcode still persist but not the adapters anymore when running porechop

BC50: ATGGACTTTGGTAACTTCCTGCGT BC50_rev: ACGCAGGAAGTTACCAAAGTCCAT

My dorado version is 0.5.2+7969fab. This is my code for modification basecalling:

dorado basecaller sup@v4.2,6mA,5mC_5hmC --device cuda:2,3 --verbose pod5/barcode50 > barcode50_6mA_5mC_5hmC.bam

From dorado documentation, it mentioned that by default the barcodes and adapters will be remove during the basecalling step, but it seems like they still persist in my modBam, I wonder if there is a bug or did I do the steps wrong somehow. Hope my observations also contribute to this issue!

wchengt commented 9 months ago

Hello! I have also noticed a good portion of adapters not being removed, they seem to be at ends where there are homopolymer artifacts. One thought I had was to allow for dorado to accept a user input of how many bases in front and end of a read to search for adapters. This would be similar to porechops "--end_size" parameter.

Please correct me if im wrong, but it looks like line 21 in AdapterDetector.cpp is setting this threshold to 75bp

const int ADAPTER_TRIM_LENGTH = 75;

I wonder if this is why these adapter sequences are being left untrimmed.

adapter_trimming

tijyojwad commented 9 months ago

Hi @femiliani - my colleague has been looking at your data and will get back to you with findings shortly!

tijyojwad commented 9 months ago

@jkh00 - thanks for the detailed analysis! One pipeline suggestion is to run basecaller cmd with --no-trim if you want to do post basecalling demuxing (since trimming can sometimes affect demux results downstream). Alternatively you can run demux during basecalling itself by adding --kit-name to the basecaller command. Can you try this out?

Presence of barcode still persist but not the adapters anymore when running porechop

This may be possible because dorado only trims barcodes when the confidence threshold is above a certain value. Those thresholds are a bit different than the actual classification thresholds. It's possible that dorado determined there's a barcode to classify, but the thresholds don't meet the trimming requirements and so those get left in. We can look into harmonizing these for sure. I suspect you'll only find a few barcodes left behind, but not all of them.

tijyojwad commented 9 months ago

@wchengt - this is a really great find! Our adapter windows are fixed now, but we can try to make our search more dynamic so it looks at a bigger window progressively. That way you don't have to worry about specifying yet another parameter :) .

kdolan1973 commented 9 months ago

@femiliani I have run some of the data you linked to through dorado version 0.5.2, and I am seeing it trim adapters from approximately 2/3rds of the reads. This is both using the basecaller with trimming enabled, and using dorado trim on data which has been basecalled but not trimmed. One question I have is, when you ran dorado without trimming enabled, does that mean you ran it without using the --trim all option, or that you ran it with the --no-trim option? I ask because --trim all is the default behavior, so to get it to not trim anything you must use --no-trim. If this is the case, then I think what is happening is that porechop is finding the barcodes, which dorado will not have trimmed since you did not use that option. Note also that for the roughly 1/3rd of the reads that dorado did not trim, it appears to me that the issue is that the basecalls for the adapter portion of the read is not very accurate. Our threshold for trimming adapters and primers is quite high (edit distance must be no more than 20% of the adapter sequence length), so it may be that porechop will more aggressively find and trim adapters or primers than we do.