Inquiry regarding demultiplexing of FAST5/pod5 files using Dorado

nanoporetech / dorado

Oxford Nanopore's Basecaller

https://nanoporetech.com/

Other

441 stars 54 forks source link

Inquiry regarding demultiplexing of FAST5/pod5 files using Dorado #834

Open mdzohorulislam opened 1 month ago

mdzohorulislam commented 1 month ago

Inquiry regarding demultiplexing of FAST5/pod5 files using Dorado

Hi there, I recently conducted direct cDNA sequencing using a barcoded library, which resulted in the generation of FAST5 files (later converted to pod5). These files, produced from a single sequencing run, are currently stored within a directory. I've successfully employed the Dorado basecaller to process and demultiplex the base-called FASTQ files using the following command: dorado basecaller -x "cuda:all" --min-qscore 7 --no-trim --emit-fastq "$dorado_model" "$input_data" | dorado demux --kit-name EXP-NBD114 --emit-fastq --output-dir "$output_dir"

However, to proceed with a nanopolish polyA pipeline, I also require demultiplexed FAST5 files. This will enable me to utilize both the FAST5 and FASTQ files within the nanopolish polyA pipeline. Could anyone provide insight into whether it's feasible to demultiplex the FAST5 files using Dorado? Any assistance or guidance on this matter would be greatly appreciated. Thank you in advance for your assistance.

HalfPhoton commented 1 month ago

Hi @mdzohorulislam - if your POD5 files are de-muliplexed already you could try converting your POD5s to FAST5s - with pod5 convert to_fast5.

If not you could get the read_ids from each of your output fastq files and use pod5 subset with the read ids & barcode mapping and then convert to fast5.

Something like this - completely unchecked but to demonstrate the workflow.

echo "read_id,barcode" > mapping.csv
for FASTQ in $(find . -iname "*fastq"); 
do
  # Get the read ids from the fastq - taken from stack overflow
  seqkit fx2tab reads.fq | cut -f 1 >> ${FASTQ}.reads.txt
  # Write lines read_id,barcode.fastq
  awk '{print $0 "," ${FASTQ}' ${FASTQ}.reads.txt >> mapping.csv
done

# Create separate pod5 files for each barcode
pod5 subset data/ --csv mapping.csv --output output_pod5s
# Convert the pod5 files (maybe one at a time to they don't write to one file)
pod5 convert to_fast5 output_pod5s

tijyojwad commented 1 month ago

Hi @mdzohorulislam - dorado also supports polyA tail estimation using the --estimate-poly-a option when basecalling. Have you given that a try?

https://github.com/nanoporetech/dorado?tab=readme-ov-file#polya-tail-estimation

mdzohorulislam commented 1 month ago

Thank you so much Richard @HalfPhoton for your valuable input. It sounds a great solution I will try and update here if it works. Thanks, Joyjit @tijyojwad. Yes the dorado polyA estimate work perfect. However, to keep things consistent with previous analysis I wanted to run it through nanopolish.