nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
458 stars 56 forks source link

capture dulpex rate to file #927

Closed hagen-wende closed 2 weeks ago

hagen-wende commented 3 weeks ago

For duplex basecalling of barcoded data I'm looping over the barcode directories to find the reads in the simplex fastqs. Then I duplex basecall these with the following command:

`/home/vc/ont/dorado_0.4.2/bin/dorado \ duplex \ /home/vc/ont/rerio/dorado_models/res_dna_r10.4.1_e8.2_400bps_sup@2023-09-22_bacterial-methylation \ --read-ids $INPUT_DIR/$barcode/all_reads.txt \ $POD5_DIR \ -r \ --min-qscore 10 \ -x "cuda:0" \

./$DUPLEX_CALLS_DIR/$barcode/duplex_calls.bam`

As a QC step I would like to extract the duplex rate from all barcodes and save them to a file. So I'm trying to capture the stdout output by appending | tee output.txt to the above scipt and would then process the file further. Unfortunately the output is empty. Is there any other way to get the duplex rate?

HalfPhoton commented 3 weeks ago

Hi @hagen-wende, The output bam file contains the dx tag which can be used to calculate the duplex rate. README#duplex

malton-ont commented 3 weeks ago

In your command, stdout is being redirected to the bam file (this is where dorado outputs the actual reads). The log messages you see are being printed to stderr, so you would need to redirect that to a separate file as well.

hagen-wende commented 2 weeks ago

Thanks for the input, I managed to get both methods to work, however they yield slightly different results (minor enough for me to ignore it)

based on @HalfPhoton suggestion I implemented this script:

#!/bin/bash

eval "$(conda shell.bash hook)"
conda activate samtools

### dx:1 bam file
duplex_length="$(samtools view ./fastq_pass_duplex_only/barcode21/duplex_reads.bam | awk '{sum += length($10)} END {print sum}')"
### dx:0 bam file
simplex_length="$(samtools view ./fastq_pass_duplex_with_simplex/barcode21/simplex_reads.bam | awk '{sum += length($10)} END {print sum}')"

eval "$(conda shell.bash hook)"
conda activate base

duplex_rate=$(awk "BEGIN {print ($duplex_length*2)/($duplex_length*2+$simplex_length)*100}")

echo "Duplex rate: $duplex_rate %"

and based on @malton-ont comment I used "2>" to redirect stderr and extract the duplex from there. I will stick to the latter method, as it is more straightforward:

    /home/vc/ont/dorado_0.4.2/bin/dorado \
    duplex \
    /home/vc/ont/rerio/dorado_models/res_dna_r10.4.1_e8.2_400bps_sup@2023-09-22_bacterial-methylation \
    --read-ids $INPUT_DIR/$barcode/all_reads.txt \
    $POD5_DIR \
    -r \
    --min-qscore 10 \
    -x "cuda:0" \
    > ./$DUPLEX_CALLS_DIR/$barcode/duplex_calls.bam \
    2> $temp_log

    # Extract the duplex rate using awk
    duplex_rate=$(awk -F ': ' '/Duplex rate/ {gsub("%",""); print $2}' $temp_log)

    # Append the formatted duplex rate information to the output file
    echo "Duplex rate for $barcode is: $duplex_rate %" >> $output_file

    # remove the temporary log file
    rm $temp_log

    echo "Duplex rate is: $duplex_rate"

Thanks for your help!