rapidsai / kvikio

KvikIO - High Performance File IO
https://docs.rapids.ai/api/kvikio/stable/
Apache License 2.0
149 stars 54 forks source link

HDF5 Direct Access #296

Closed madsbk closed 3 months ago

madsbk commented 11 months ago

As discussed in #295, we have multiple approaches to support HDF5 files. Let's look at approach (2) in more detail.

The idea is to parse HDF5 metadata and extract contiguous data blocks similar to how Kerchunk's HDF5 backend works but we want to support both read and write.

Features

Limitations

We are going to have the same limitations as HDF5's direct write function H5Dwrite_chunk():

Implementation

For the initial implementation, we do all metadata manipulation in Python using h5py. Later, to reduce overhead and make it available in C++, we can port it to C++ and use the official HDF5 library or maybe a more high-level library like HighFive or h5cpp.

Write HDF5 Dataset
  1. Compress data using nvCOMP, optional.
  2. Use h5py to write an empty dataset using the ALLOC_TIME_EARLY option, which will make sure the data blocks within the hdf5 file are allocated immediately.
    • Notice, the ALLOC_TIME_EARLY option only works when HDF5 compression is disabled.
  3. Write a HDF5 attribute that describes the compression algorithm used (if any).
  4. Use h5py to parse the HDF5 metadata.
  5. Translate the metadata into a set of data blocks (file, offset, size) .
  6. Use KvikIO to write from the input buffer to the data blocks.
Read HDF5 Dataset
  1. Use h5py to parse the HDF5 metadata
  2. Read HDF5 attributes to determine the decompression algorithm.
  3. Translate the metadata into a set of data blocks (file, offset, size) .
  4. Use KvikIO to read the data blocks into the output buffer (device or host memory).
    • Optionally, decompress the data on-the-fly using nvCOMP.
manopapad commented 11 months ago

Instead, KvikIO will implement its own compression layer above the HDF5 file and store compression information as HDF5 attributes.

Would this be a "custom" extension? I.e. a third-party app that receives a kvikio-compressed HDF5 file wouldn't know what the attributes mean, therefore it wouldn't know that the file is compressed in some way?

No native compression (other then the KvikIO specific compression)

Reading the paper you linked it appears that when using H5Dwrite_chunk you can still record on the HDF5 metadata that each chunk is compressed, and it's just your responsibility to make sure the data you write is already compressed (and you must use gzip). In that case any consumer of the file would know how the data is compressed.

No filters. No datatype conversion. No endianness conversion. No user-defined functions.

On Legate land we can live without these for now. We can hope that in the future the Legate core will be able to detect that a data-parallel load can be fused with an element-wise conversion that comes after, and thus get the same performance as if the I/O library provided these transformations internally.

No variable length data types. We might be able to support strings.

As far as in-memory representation goes, Legate would store, say, a string array using two stores, one for the character data and one for the offsets. The partitioning between the two would be consistent (e.g. all the offsets within chunk 5 of the "offsets" store would point to characters within chunk 5 of the "characters" store). Saving this directly as two datasets to a chunked HDF5 file sounds problematic, because it is quite unlikely that all the chunks on each store would have the same size. Perhaps there could be added padding to make all sizes uniform.

I couldn't easily find out how HDF5 handles chunked variable-length arrays internally.

madsbk commented 11 months ago

Would this be a "custom" extension? I.e. a third-party app that receives a kvikio-compressed HDF5 file wouldn't know what the attributes mean, therefore it wouldn't know that the file is compressed in some way?

Yes, a third-party wouldn't be able to read the kvikio-compressed HDF5 file.

Reading the paper you linked it appears that when using H5Dwrite_chunk you can still record on the HDF5 metadata that each chunk is compressed, and it's just your responsibility to make sure the data you write is already compressed (and you must use gzip). In that case any consumer of the file would know how the data is compressed.

Right, using H5Dwrite_chunk would work but we wouldn't get the performance of KvikIO/GDS since HDF5 itself would do the writing. Also, H5Dwrite_chunk isn't thread-safe.

Saving this directly as two datasets to a chunked HDF5 file sounds problematic, because it is quite unlikely that all the chunks on each store would have the same size. Perhaps there could be added padding to make all sizes uniform.

Agree, but notice that we are not bound by the chunks in HDF5. E.g., Legate tasks can read multiple chunks or even partial chunks. The advantages of extracting all data blocks offsets beforehand is that we can access the data blocks in any way we like; including changing the decomposition on-the-fly.

manopapad commented 11 months ago

Right, using H5Dwrite_chunk would work but we wouldn't get the performance of KvikIO/GDS since HDF5 itself would do the writing. Also, H5Dwrite_chunk isn't thread-safe.

Is it possible to preallocate a chunked dataset, then query hdf5 for the file name, file offset and extent corresponding to each chunk? If we can do that, then we could potentially record on the metadata that each chunk will be zlib-compressed, then copy each chunk in its entirety (already compressed) from framebuffer to disk using GDS (without having to go through H5Dwrite_chunk).

madsbk commented 11 months ago

Is it possible to preallocate a chunked dataset, then query hdf5 for the file name, file offset and extent corresponding to each chunk?

Yes, this is exactly what I mean in step 5: Translate the metadata into a set of data blocks (file, offset, size).

If we can do that, then we could potentially record on the metadata that each chunk will be zlib-compressed, then copy each chunk in its entirety (already compressed) from framebuffer to disk using GDS (without having to go through H5Dwrite_chunk).

Good point. However, modifying the metadata might be tricky and hard to maintain but definitely a possibility!

manopapad commented 11 months ago

Yes, this is exactly what I mean in step 5: Translate the metadata into a set of data blocks (file, offset, size).

Yup, my bad, I didn't read that carefully.

Good point. However, modifying the metadata might be tricky and hard to maintain but definitely a possibility!

Agreed, that is definitely a risk, but it also gives us the best chance of interoperating with downstream/upstream apps, that may be reading/writing their HDF5 files outside of KvikIO.

tell-rebanta commented 11 months ago

Correct me if I am wrong, step-1 (for READ) and step-4(for WRITE) essentially would be single threaded and once the metadata is accessed, then the we can perform multi-threaded I/O through kvikio/gds. What I missing here is when multiple processes want to access the same file then how the hdf5 metadata will be synchronized among them to give a consistent view.

manopapad commented 11 months ago

We can ask Legate to run only one copy of the task which processes the HDF5 metadata (i.e. not replicated across the cluster), then broadcast the results to the other processes. Then all the processes can do the actual reads and writes in parallel.

tell-rebanta commented 11 months ago

Depending upon the outcome of the actual I/O especially in case of an error, we may once again need to consolidate the meta data - isn't it ?

manopapad commented 11 months ago

That's a good point. We could do a similar singleton task launch that updates the metadata based on the status reports of the parallel readers/writers, but I guess that depends on what error cases the workers might encounter.

qkoziol commented 7 months ago

There's a few clarifications that may help:

akshaysubr commented 5 months ago

It is going to hard to support compression that are compatible with the ones built into HDF5.

This is not necessarily true. It would be if using the nvCOMP high level API which would be the most natural fit for custom HDF5 filters. But if you can use the nvCOMP low level API, those are fully compatible with the standard stream formats (including gzip). This is harder to integrate into HDF5 though since it requires batched decompression. @qkoziol mentioned that this might be possible though :)