google / fhir-data-pipes

A collection of tools for extracting FHIR resources and analytics services on top of that data.
https://google.github.io/fhir-data-pipes/
Apache License 2.0
154 stars 87 forks source link

Make Bunsen's SQL-on-FHIR schema consistent with GCP HCLS's #454

Closed bashir2 closed 1 year ago

bashir2 commented 1 year ago

There are currently some differences between Bunsen's SQL-on-FHIR schema and what the GCP HCLS tools create in BigQuery. We should make these two consistent.

bashir2 commented 1 year ago

Discussing with @atulai-sg, here is my suggestion for how to make decisions and going forward re. this issue:

Note that even if achieve the same schema goal (which we should, if it is not too much work), still the SQL queries against BigQuery would be different from Spark+Parquet. That is because of differences in SQL dialects (and we are working on fhir-views to address that but that is a separate issue).

jjtswan commented 1 year ago

Any updates here?

atulai-sg commented 1 year ago

I have evaluated work for https://github.com/google/fhir-data-pipes/issues/455 and it seems its doable so we might want to take that route instead of doing this.

atulai-sg commented 1 year ago

[INTERNAL]: here is the document explaining the findings and approaches taken so far in an attempt to solve this:

go/bunsen-sql-on-fhir

jjtswan commented 1 year ago

I started to investigate this a bit, and I had a couple of realizations.

TLDR: @atulai-sg - What happens if we try to take the current FHIR-data-pipes parquet output and then use the enableListInference option (see here and here)?

If this works and there's no other differences, then I think we can resolve this bug as we have a path for fhir-data-pipes to have consistent output with GCP's export to BQ.

That said, it's worth looking at the type BigQuery's Parquet type conversions to anticipate other potential differences.

(Hat tip to @omarismail94 if this actually works.)

Details --

BQ & Parquet representation of lists of composite types (e.g. groups / structs) seem to be slightly different.

This is evident from existing GCP BQ documentation: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#parquet_conversions

Case A - In BQ, it prefers to represents this as:

repeated struct <name> {...}

Case B - In Parquet, it prefers to represent this as:

optional group <name> (LIST) {
  repeated group list { ... }
}

Case C - What's interesting here is that if you were to try to translate Parquet's version of list directly into BQ format, I think you would get:

nullable struct <name> {
  repeated struct list {...}
}

Which is what we are seeing.

That said, it does seem like GCP realizes that loading parquet files into BQ as Case C is quite cumbersome and provides enableListInference as an option to auto-detect this situation and interpret this as the cleaner Case A (see here and here).

The current parquet schema output of FHIR-data-pipes looks very close to what GCP list inference looks for

Case D When I run the pipeline, I can use the parq-cli tool to get the schema. Here's the example of the patient name:

  optional group field_id=-1 name (List) {
    repeated group field_id=-1 array {
      ...
    }
  }

You'll notice that this is almost the same as Case B which GCP should be able to detect and convert to the simpler BigQuery schema if we use enableListInference option. The only difference is that the fhir-data-pipes output uses array as the name of the inner group, whereas the example in Case B uses list as the name.

So if enableListInference doesn't work the first time, try renaming it from array to list?

omarismail94 commented 1 year ago

I think we can close this now! We can get the schemas to match and upload successfully to BigQuery with the following command:

bq --project_id=PROJECT_ID \
     --autodetect    \
     --source_format=PARQUET  \
      --parquet_enable_list_inference \
       DATASET_NAME.TABLE_NAME gs://GCS_BUCKET_NAME/PARQUET_WILDCARD

The magic flag is --parquet_enable_list_inference, which uses schema inference for Parquet LIST logical types: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#list_logical_type

jjtswan commented 1 year ago

[putting on my best Bashir impression]

Definitely a lot of progress, but it looks like Atul has found some other differences in how fhir-data-pipes exports in comparison to BQ, and even what is mentioned in the SQL-on-FHIR spec. I think we should leave this open.

@atulai-sg Can you triage the differences, recommend actions (ideally with a work estimate and priority), and consider what kind of testing we might want to put into place? One easier possibility might be to provide create a gold-file test that checks if the parquet schema from the output matches a "golden version" of the schema.

bashir2 commented 1 year ago

Thanks folks for the progress on this, it seems the main issue around Parquet vs BigQuery list handling is resolved, at least in one direction (i.e., Parquet -> BQ) which is great.

Yeah, I agree that to close this issue we need to:

Once we answer the above two, we should close this. I am pretty sure the list of schema differences is larger than just that one list issue (although that was a major one).

jjtswan commented 1 year ago

FYI Bashir - There's already been some differences identified in the document. Some of them are things where our output doesn't follow the SQL-on-FHIR spec (e.g. we're not dropping "Id"s while GCP export is).

bashir2 commented 1 year ago

I think @atulai-sg has discovered some new differences, e.g., around having identifier in References. So I am reopening this to track those issues here as well.