Open femiliani opened 9 months ago
Hi @femiliani - I don't see any issues with your command line and dorado should be trimming bases in this case.
Would you be open to sharing a subset of reads from your dataset for us to take a look? Can you also share what kind of library prep you're using and the read length distribution?
Of course @tijyojwad, here are a couple of the files: https://drive.google.com/drive/folders/1lKNmQNAM2g0q2Qu3HpDjOn7gXJGUY3Jz?usp=sharing These are shorter reads (~1kb), prepped with SQK-NBD114-24
We've since tested it on another data set (sample was very similar but the preps were with the non-barcoded kits) which we subset and analyzed in batches and have had inconsistent trimming results, despite running the same analysis on all of the batches. Some, when run through porechop, had 90% adapters still present (similar to my original post), others have had 40% or 10% left behind, and in one case none were found. I suspect this last case was because the first few reads porechop examined for adapters were trimmed and so it never determined which adapters to look for, unfortunately we had to take the computer offline for some hardware updates so i couldn't do further analysis to verify that.
Thanks so much for your help, curious as to what might be going on!
Hi @femiliani thank you so much for sharing your data and distribution. we'll take a closer look and get back to you
Hi @femiliani - can you also share some of the non-barcoded data?
Hi,
Jumping on this, I also noticed the same issue with using dorado trim
, where it doesnt seems to be trimming adapters.
When I do:
mod basecalling > dorado trim > validate with porechop
Presence of adapters and barcodes were recorded from porechop
SQK-NSK007_Y_Top: [31mAATGTACTTCGTTCAGTTACGTATTGCT[0m SQK-NSK007_Y_Bottom: [31mGCAATACGTAACTGAACGAAGT[0m BC50: [31mATGGACTTTGGTAACTTCCTGCGT[0m BC50_rev: [31mACGCAGGAAGTTACCAAAGTCCAT[0m BC51: [31mGTTGAATGAGCCTACTGGGTCCTC[0m BC51_rev: [31mGAGGACCCAGTAGGCTCATTCAAC[0m BC52: [31mTGAGAGACAAGATTGTTCGTGGAC[0m BC52_rev: [31mGTCCACGAACAATCTTGTCTCTCA[0m
On top of that, I also noticed that while the demultiplexing was successful (using dorado demux
), the barcode sequences were not removed from the output and the adapters seems to be trimmed with dorado demux
rather than in 'dorado trim'.
When I do:
mod basecalling > dorado trim > dorado demux > validate with porechop
Presence of barcode still persist but not the adapters anymore when running porechop
BC50: [31mATGGACTTTGGTAACTTCCTGCGT[0m BC50_rev: [31mACGCAGGAAGTTACCAAAGTCCAT[0m
My dorado version is 0.5.2+7969fab
. This is my code for modification basecalling:
dorado basecaller sup@v4.2,6mA,5mC_5hmC --device cuda:2,3 --verbose pod5/barcode50 > barcode50_6mA_5mC_5hmC.bam
From dorado documentation, it mentioned that by default the barcodes and adapters will be remove during the basecalling step, but it seems like they still persist in my modBam, I wonder if there is a bug or did I do the steps wrong somehow. Hope my observations also contribute to this issue!
Hello! I have also noticed a good portion of adapters not being removed, they seem to be at ends where there are homopolymer artifacts. One thought I had was to allow for dorado to accept a user input of how many bases in front and end of a read to search for adapters. This would be similar to porechops "--end_size" parameter.
Please correct me if im wrong, but it looks like line 21 in AdapterDetector.cpp is setting this threshold to 75bp
const int ADAPTER_TRIM_LENGTH = 75;
I wonder if this is why these adapter sequences are being left untrimmed.
Hi @femiliani - my colleague has been looking at your data and will get back to you with findings shortly!
@jkh00 - thanks for the detailed analysis! One pipeline suggestion is to run basecaller
cmd with --no-trim
if you want to do post basecalling demuxing (since trimming can sometimes affect demux results downstream). Alternatively you can run demux during basecalling itself by adding --kit-name
to the basecaller
command. Can you try this out?
Presence of barcode still persist but not the adapters anymore when running porechop
This may be possible because dorado only trims barcodes when the confidence threshold is above a certain value. Those thresholds are a bit different than the actual classification thresholds. It's possible that dorado determined there's a barcode to classify, but the thresholds don't meet the trimming requirements and so those get left in. We can look into harmonizing these for sure. I suspect you'll only find a few barcodes left behind, but not all of them.
@wchengt - this is a really great find! Our adapter windows are fixed now, but we can try to make our search more dynamic so it looks at a bigger window progressively. That way you don't have to worry about specifying yet another parameter :) .
@femiliani I have run some of the data you linked to through dorado version 0.5.2, and I am seeing it trim adapters from approximately 2/3rds of the reads. This is both using the basecaller with trimming enabled, and using dorado trim on data which has been basecalled but not trimmed.
One question I have is, when you ran dorado without trimming enabled, does that mean you ran it without using the --trim all
option, or that you ran it with the --no-trim
option? I ask because --trim all
is the default behavior, so to get it to not trim anything you must use --no-trim
.
If this is the case, then I think what is happening is that porechop is finding the barcodes, which dorado will not have trimmed since you did not use that option.
Note also that for the roughly 1/3rd of the reads that dorado did not trim, it appears to me that the issue is that the basecalls for the adapter portion of the read is not very accurate. Our threshold for trimming adapters and primers is quite high (edit distance must be no more than 20% of the adapter sequence length), so it may be that porechop will more aggressively find and trim adapters or primers than we do.
I am wondering if I am using the trim function incorrectly. The adaptors remain in the reads regardless of whether I basecall with
--trim all
or pass the basecalled reads through the trim module. When search for the adaptor sequences i find it in most reads regardless of whether they are untrimmed, basecalled+trimmed, or post-trimmed. additionally, if we trim with dorado, does the MN tag and associated methylation information remain intact?My code:
dorado basecaller --trim all /path/dna_r10.4.1_e8.2_400bps_sup@v4.3.0 ./pod5 --modified-bases 5mCG_5hmCG -l ./readids/readnames.txt -b 350 --min-qscore 10 --emit-moves > basecalled.bam
anddorado trim basecalled.sam > trim.bam
My version of dorado:
0.5.2+7969fab
When i try to trim using Porechop: Dorado without trimming:
Dorado with trimming:
Thanks a bunch!