hail-is / hail

Cloud-native genomic dataframes and batch computing
https://hail.is
MIT License
982 stars 246 forks source link

[query] Hail should provide `hl.saige` in both QoS and QoB #13442

Open danking opened 1 year ago

danking commented 1 year ago

What happened?

SAIGE and its competitor REGENIE are the standard bearers for modern GWAS. Hail should expose SAIGE within the Hail Query language. The interface should roughly match hl.linear_regression_rows.

A Batch pipeline would serve the needs of Broadies (and, indeed, such a pipeline already exists) but has two downsides:

  1. There is substantial I/O involved in exporting the data from Hail-native formats to SAIGE-compatible formats.
  2. Non-Broadies cannot use this pipeline.

Query language support for SAIGE would transform the accessibility of SAIGE by making it usable at scale by anyone with access to Hail, which is basically anyone with a large dataset (e.g. DNANexus, AoU RWB, MVP, FinnGen).

There are two options:

  1. Determine and implement the linear algebraic primitives necessary for SAIGE.
  2. Compile and link directly against SAIGE. Expose these functions, via JNI, to the Hail Query language.

Version

0.2.120

Relevant log output

No response

danking commented 1 year ago

If we go through route (2), this project can serve as a prototype C or C++ interface to Hail. This interface could take multiple forms. For example, we could actually re-build our memory representation implementations in C++ and compile SAIGE, at Hail-Query-compile-time (i.e. when we are compiling a user's query), to use whatever SType/PType that Hail has decided is the ideal.

A simpler approach is to implement one canonical implementation of the Hail types in C++, fork & slightly modify SAIGE to accept these memory representations, compile SAIGE at Java compile time (i.e. in CI or when you run make on your laptop) against these mem reps, ship the compiled library with the Hail JAR, and expose it, via JNI, into the Hail Query language. This requires that the Query compiler can call a function which only supports arguments using one particular SType/PType.

danking commented 1 year ago

Types should be a non-issue because SAIGE is likely using primitive types or arrays thereof. We already do this when we call LAPACK.

A major issue identified is that SAIGE is licensed as GPL and Hail is licensed with the incompatible MIT license.

jigold commented 7 months ago

See #13804 for an incomplete prototype of hailtop.saige implemented on top of Batch.