nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
525 stars 63 forks source link

Large number of unclassified reads after demuxing #979

Closed Ishrektd closed 2 months ago

Ishrektd commented 3 months ago

I am attempting to basecall using dorado, and I had a question regarding which kit I should provide as part of their --kit-name flag when using dorado basecaller.

In my library preparation, I used SQK-LSK110 chemistry with the PBC-096 Barcoding Kit to sequence a pooled library of 16S amplicons. Sequencing was done on a MinION Flow Cell R9.4.1 and the resulting output was in .pod5 format.

When attempting to basecall, I used the dna_r9.4.1_e8_sup@v3.6 model and for my kit, I provided EXP-PBC096 as that's the barcoding kit I used in my experiment. However, although this did work, I'm unsure if this is correct because when setting the parameters to trim all adapters/primers/barcodes by default and then demultiplexing, the resulting unclassified.bam file was 7.7G (total output was 7.8G).

To my knowledge, the kit-name should refer to the barcoding kit, and the chemistry should be indicated by the basecalling model selected (in this case, e8 refers to SQK-LSK110 for flow cell r9.4.1).

In this case, would my process be correct?

For reference, here is my full code:

### Step 1.) Source config:
echo "Step 1: set up config"
config="/path/to/config"

### Step 2.) Dorado to basecall data 
echo "Step 2: Set up variables"
model=dna_r9.4.1_e8_sup@v3.6  
kitname=EXP-PBC096 
reference_index=FASTQ 
demuxdir="/path/to/demux/directory"
data="/path/to/pod5"
sample="/path/to/sample/sheet"
modir="/path/to/model"

## NOTE: Dorado will automatically try to detect and remove adapter sequences

### Step 3: Run dorado basecaller
echo "Step 3: Run dorado basecaller"
cd $modir
dorado basecaller $model $data --kit-name $kitname > 16S.bam

### Step 4: Run dorado demux
echo "Step 4: Run dorado demux":
cd $modir
dorado demux --kit-name $kitname --sample-sheet $sample --output-dir $demuxdir 16S.bam

echo "Done"
malton-ont commented 2 months ago

Hi @Ishrektd,

As you have already barcoded the reads during basecalling, you should use the --no-classify option with dorado demux instead of specifying the --kit-name. This is because the barcodes have been trimmed during the first step, so there are now no barcodes to identify in the second step.

Ishrektd commented 2 months ago

I see, thank you! I ended up redoing it with simple basecalling before your comment: dorado basecaller sup pod5s --no-trim > calls.bam

After this, I did demuxing using the following flags: dorado demux --kit-name --barcode-both-ends --sample-sheet --emit-fastq --output-dir calls.bam

In this case, the barcodes, adapters, and primers would not have been trimmed during basecalling, but the demultiplexing step should trim adapters/primers/barcodes... would this process be correct?

malton-ont commented 2 months ago

@Ishrektd,

dorado demux will trim barcodes but it does not perform additional trimming of adapters/primers, so if no barcode is found (or if these are outboard of the adapters!) then the adapters/primers will not be trimmed. If this is an issue for you, follow the suggestion I gave above:

dorado basecaller sup pod5s  --kit-name $kitname --barcode-both-ends --sample-sheet $sample > calls.bam
dorado demux --no-classify --emit-fastq --output-dir $demuxdir calls.bam
Ishrektd commented 2 months ago

Thank you for your help! I will try this out 🙂

malton-ont commented 2 months ago

Another option would be to do as you had, but add a call to dorado trim for each demuxed file.