Open nick-youngblut opened 2 years ago
Hi @nick-youngblut -- you can do this with fast5_subset
, though it will require you to give it a list containing all the read_ids you currently have (which I believe you should be able to easily generate from your call to single_to_multi_fast5
). We can certainly look into allowing the read_id list from fast5_subset
to be optional, at which point it will do exactly what you're after.
Thanks for pointing out that option. I was looking for a computationally efficient and straight-forward way of changing the number of sequences per fast5 (more or less seqs per file) -- a split/aggregate script. I'm guessing that most just use the now default 4k sequences per fast5 and never want to change it, so maybe 4k-per-file is optimal for most/all situations.
single_to_multi_fast5
can be used to reduce the number of files per sequencing run (eg., 100's of 1000's down to just 1000's via selecting the appropriate--batch_size
). If one would want to change the number of sequences per fast5 (eg., to further reduce the total number of files), one cannot usesingle_to_multi_fast5
again on the mullti-fast5 files with a larger--batch-size
.It would be helpful to add a script (e.g.,
multi_to_multi_fast5
) that could alter the number of sequences per fast5 file: either by combining sequences or splitting them, depending on the total number of fast5 files that the user wants.