Is it possible for the `fhir-data-pipes` to sink directly into a Data Warehouse e.g. Google BigQuery?

google / fhir-data-pipes

A collection of tools for extracting FHIR resources and analytics services on top of that data.

https://google.github.io/fhir-data-pipes/

Apache License 2.0

150 stars 84 forks source link

Is it possible for the `fhir-data-pipes` to sink directly into a Data Warehouse e.g. Google BigQuery? #1191

Open muhammad-levi opened 1 week ago

muhammad-levi commented 1 week ago

Instead of: fhir-data-pipes -> Google Healthcare API FHIR Store -> Google BigQuery

It will be like: fhir-data-pipes -> Google BigQuery

As also suggested in this diagram

"Data Loaders" includes fhir-data-pipes.

bashir2 commented 1 week ago

Actually this feature is the long standing issue #455, i.e., adding BigQuery as a sink option. It should not be too hard to add this and I think it is a useful feature. The main reason we have not implemented it yet is that we have not heard much demand for it from our partners. If this is a useful feature for you and you can contribute for implementing it, I am willing to help.

Side note 1: We have actually done some work in #454 to make the resulting schema similar to the BigQuery schema of GCP FHIR store -> BigQuery flow.

Side note 2: You can import Parquet files into BigQuery; that's how the comparisons in #454 was done.

muhammad-levi commented 1 week ago

@bashir2 I see. Initially I was thinking of using the JDBC driver for BigQuery and try and create a sample JDBC URL config for BigQuery in the DatabaseConfiguration https://github.com/google/fhir-data-pipes/blob/dc70755848b2ea83390a2699ed05ed6088875eec/pipelines/common/src/main/java/com/google/fhir/analytics/model/DatabaseConfiguration.java#L58-L62

and then make use of the sinkDbConfigPath config property. https://github.com/google/fhir-data-pipes/blob/dc70755848b2ea83390a2699ed05ed6088875eec/pipelines/controller/config/application.yaml#L168-L173

References

bashir2 commented 1 week ago

@muhammad-levi your JDBC based idea can work but since we use Beam for our pipeline, I would first consider BigQueryIO; it is usually better to rely on Beam IOs when it is possible. That said, there are reasons not to use them; for example, in some places, we don't use ParquetIO for creating Parquet files (mostly because of Flink's memory overhead in the single-machine mode).