multi_to_multi_fast5 - Githubissues

nick-youngblut commented 2 years ago

single_to_multi_fast5 can be used to reduce the number of files per sequencing run (eg., 100's of 1000's down to just 1000's via selecting the appropriate --batch_size). If one would want to change the number of sequences per fast5 (eg., to further reduce the total number of files), one cannot use single_to_multi_fast5 again on the mullti-fast5 files with a larger --batch-size.

It would be helpful to add a script (e.g., multi_to_multi_fast5) that could alter the number of sequences per fast5 file: either by combining sequences or splitting them, depending on the total number of fast5 files that the user wants.

fbrennen commented 2 years ago

Hi @nick-youngblut -- you can do this with fast5_subset, though it will require you to give it a list containing all the read_ids you currently have (which I believe you should be able to easily generate from your call to single_to_multi_fast5). We can certainly look into allowing the read_id list from fast5_subset to be optional, at which point it will do exactly what you're after.

nick-youngblut commented 2 years ago

Thanks for pointing out that option. I was looking for a computationally efficient and straight-forward way of changing the number of sequences per fast5 (more or less seqs per file) -- a split/aggregate script. I'm guessing that most just use the now default 4k sequences per fast5 and never want to change it, so maybe 4k-per-file is optimal for most/all situations.

nanoporetech / ont_fast5_api

multi_to_multi_fast5 #65