hail-is / hail

Cloud-native genomic dataframes and batch computing
https://hail.is
MIT License
982 stars 246 forks source link

[query] Consider adding import/export support for HDF5 #14311

Open patrick-schultz opened 8 months ago

patrick-schultz commented 8 months ago

HDF5 could be a natural file format for matrix tables, esp. block partitioned and/or higher-dimensional generalizations. More near term, HDF5 is used for large single cell data, and adding import would allow hail to be used. https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/Convert.20matrix.20in.20hdf5.20to.20hail.20MatrixTable/near/421945261

danking commented 8 months ago

HDF5 "files" are usually literally a single file. While fine for traditional file systems, this is not a good fit for object stores like GCS and S3. Object stores tend to scale horizontally providing high aggregate bandwidth across many individual objects.

There appear to be some efforts to permit HDF5 to read and write to object stores in an object-store-friendly manner. In particular, there is a GCS connector. It's not an object store, but there's also support for Hadoop HDFS. There's also the Virtual Object Layer which appears to be a file system abstraction that would permit storing HDF5 "files" in multiple objects which plays well with cloud object store scaling.

We should prioritize an importer because no one has asked for HDF5 export nor is it clear that the HDF5 client libraries make it easy to write a single HDF5 "file" from a cluster of cores separated by a network.

An importer would look something like MatrixVCFReader. It will need to use an HDF5 Java client library. An HDF5 client API is described here but they don't link to any JARs or maven repositories. This support thread from 2022 appears to ultimately conclude that netcdf-java supports reading HDF5 files. Including netcdf-java in a gradle or maven project is described here.

It is not entirely clear how to use netcdf-java to access objects in Google Cloud Storage or Azure Blob Storage. There's an open issue to support S3.


OK, so, this is roughly what I'd do:

Driver side:

  1. Get the schema, cook up a corresponding Hail type.
  2. Choose a partitioning of the index space.

Worker side:

  1. Read the same slice of each field/column based on the partition information.
  2. Construct a Hail SType/PType. See GVCFPartitionReader for an example. That class is misnamed, it's just a VCF partition reader, its not specific to GVCFs.