hasindu2008 / slow5tools

Slow5tools is a toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.
https://hasindu2008.github.io/slow5tools
MIT License
90 stars 6 forks source link

Compress subset of reads #69

Closed mbhall88 closed 11 months ago

mbhall88 commented 2 years ago

It would be nice if there was an option to pass a list of read IDs you want compressed. Similar to the -l,--read_id_list option in fast5_subset.

hasindu2008 commented 2 years ago

This subset option is available in slow5tools get module. Slow5tools get takes a blow5 file and a list of readid as inputs and produce an output with only the records specified in the list. Will this be suitable for your usecase?

mbhall88 commented 2 years ago

Ah sorry, missed that. Yes and no. In an example where I have 1000 fast5 files and only want to compress say 20 of them, I would need to compress them all and then subset after that with get. It is a minor edge case I realise sorry.

I realise there are other ways of doing this, like subsetting with fast5_subset upfront and then compressing that subset. But in that example I end up duplicating the fast5 data initially and it's one extra step.

Anyway, it would be a nice feature to have in f2s, but I appreciate it isn't high on the priority list given there are a couple of ways of achieving the same result. Do you think it would be something you would be interested in adding at some point?

hasindu2008 commented 2 years ago

Hi @mbhall88

Due to the requirement of multi-processes to go around HDF5, the need to support multi and single fast5 files and all those weird inconsistencies in FAST5 files, now this f2s programme has got a bit complex. Also because f2s is used for converting to slow5 we are being extra careful to avoid any bugs and do a massive amount of testing after each tiny modification. I will discuss with @hiruna72 if we can get this option without major source code modification in slow5tools 0.5, the next next release though.

Slow5tools 0.4 will be soon released with the option to retain directory structure, as well as some performance improvements to split subtool after I do some integrity tests. Also, we are improving this real-time fast5 to slow5 conversion script and documenting it properly https://github.com/hasindu2008/slow5tools/tree/dev/scripts/realtime-f2s, in that way one can run it directly on the sequencing acquisition computer so that SLOW5 are available as soon after the sequencing run. You can give it a go.

mbhall88 commented 2 years ago

Thanks @hasindu2008. I appreciate the information. Let me know when you've discussed with the team and whether you think it will be included at some future time.

hasindu2008 commented 2 years ago

As this new POD5 format is coming, we thought of putting effort into writing the converter for it. While doing it would be able to do this feature into the POD5->BLOW5 converter if you are still interested. As FAST5 is being phased out by ONT, I think it is not worth putting effort into the FAST5->BLOW5 converter to implement this feature, especially given that there are other ways to achieve it (like converting to blow5 and then using slow5tools get, even though not the most efficient way).