'demux_fast5' (and 'fast5_subset') for single read fast5s

AAnnan commented 3 years ago

Hi,

I'm a Megalodon user primarily, and since Megalodon doesn't have barcoding capabilities yet, demultiplexing needs to be performed separately first.

Originally, for single read fast5s files I launched a Guppy basecall run with barcoding and fast5_out enabled + a little bash script to make lists of read_ids by barcode from sequencing_summary.txt and retrieve the single read fast5s from the workspace folder, then I reconverted them to multi, for storage and because it performs better with Megalodon. For multi read fast5 files I did the same, only using 'multi_to_single_fast5' prior to the first step.

'demux_fast5' looks well suited for me. It combines my bash script retrieving read_ids and barcode info and 'single_to_multi_fast5', while saving space by extracting the barcoded fast5s from the raw data instead of the Guppy basecalled on (leaving out unnecessary sequence and basecalling information). However it performs extremely poorly on single read fast5s files. Is there a way to make it perform better on single fast5s? That would save me a run of 'single_to_multi_fast5'.

PS: Something even simpler would be to have Guppy_barcoder be able to take in raw fast5s and barcode them. Or Guppy to not only demultiplex the fastq output but also the fast5 one when the fast5_out flag is enabled...

fbrennen commented 3 years ago

Hi @AAnnan -- in general we don't target single-read files, as they've been deprecated for a long time now, though it's certainly something we could look in to. From what you describe above it's not clear to me why you'd need single-read files at all though -- why not go from multi-read to demultiplexed multi-read?

Basecall and barcode with guppy (no need to turn on fast5 output).
Take the summary file from that and your original multi-read files, and use demux_fast5 to split them.

AAnnan commented 3 years ago

Hi @AAnnan -- in general we don't target single-read files, as they've been deprecated for a long time now, though it's certainly something we could look in to. From what you describe above it's not clear to me why you'd need single-read files at all though -- why not go from multi-read to demultiplexed multi-read?

I don't need single-read files, it's simply how I sometimes get the data to analyse.

1. Basecall and barcode with guppy (no need to turn on fast5 output).

2. Take the summary file from that and your original multi-read files, and use `demux_fast5` to split them.

Yeah, that's exactly what I do now (disabling the fast5 output speeds it up significantly). However, sometimes my original raw fast5 files are single.

fbrennen commented 3 years ago

(sorry, hit the wrong button!)

fbrennen commented 3 years ago

Great, ok -- I'm glad the multi-read case is working well for you. Out of curiosity, where do you get those single-read files?

AAnnan commented 3 years ago

A 2019 dual-enzyme methylation experiment. I'm reprocessing some of this data now. It's not super important, I can always add a check for single or multi files and run single_to_multi_fast5 in the case of singles.

nanoporetech / ont_fast5_api

'demux_fast5' (and 'fast5_subset') for single read fast5s #60