hasindu2008 / slow5lib

slow5lib is a software library for reading & writing SLOW5 files.
https://hasindu2008.github.io/slow5lib
MIT License
41 stars 4 forks source link

[Discussion] a more general format to replace HDF5 #41

Open lh3 opened 3 years ago

lh3 commented 3 years ago

SLOW5 is specific to Nanopore data, not generalizable to other data types. If someone want to keep large data arrays, they either need to come up with a new customized format or have to use HDF5. I wonder how hard it is to replace HDF5 with a new binary format that keeps the major features of HDF5 but has a simpler file structure and is easier to implement. This could be beneficial to a large community beyond Nanopore. I haven't used HDF5 C APIs. My thought could be naive here...

Psy-Fer commented 3 years ago

Like SAM/BAM is unique to alignment data, not any data, SLOW5 is unique to raw signal data. But the underlying principles are similar. Global data at the top followed by unique records. It only gets tricky when storing complex related data that isn't straightforward to put into a single line, which is where HDF5 shines, and why areas like single-cell have adopted it.

I think the HDF5 devs would work more in these specialised areas if they had the funding/support (like most things)

Psy-Fer commented 3 years ago

Thinking on this further. If the file type was generalised, and the file scheme was dynamic, this would allow users to create stable, version controlled files, where each scheme conformed to the primary/auxilary field requirements.

hasindu2008 commented 3 years ago

@lh3

While a simple filer structure may not be able to cover all the use cases of HDF5, I believe that a majority of the use cases can be covered by a format with a simpler structure. Why I think so is because in most cases temporal or spatial locality is present and that is a key feature that is exploited by the memory hierarchy in modern computer systems.

In this SLOW5 format, the most important fields for nanopore signal analyses were kept as primary fields and the rest as auxiliary fields, just as in SAM/BAM. The auxiliary fields can be of any order and any data type. A starting point for a generic format would be one with only the "auxiliary" fields.

For any use-case where data comes in form of records where a single or a few fields act as the key (like the read ID or chr:m-n in genomics), having all the data in the record contiguously and reading the whole record at once to memory (assuming a single record is not as big as a few hundred gigabytes) would be better in terms of performance.