Make Bunsen's SQL-on-FHIR schema consistent with GCP HCLS's

bashir2 commented 1 year ago

There are currently some differences between Bunsen's SQL-on-FHIR schema and what the GCP HCLS tools create in BigQuery. We should make these two consistent.

bashir2 commented 1 year ago

Discussing with @atulai-sg, here is my suggestion for how to make decisions and going forward re. this issue:

First evaluate whether it is even possible to create a Parquet schema where uploading that Parquet file to BigQuery, creates the same schema that GCP FHIR-store->BigQuery process creates. The question is specifically around repeated/nested fields.
If the answer to the above question is "yes", then evaluate how much time is needed to address all schema differences, as there seems to be many differences. Based on this evaluation decide whether to proceed or not.
At the same time, also evaluate how much time is needed to fix #455. If we can't make the Parquet schema consistent with BigQuery OR it will take similar amount of time to fixing #455, stop working on this issue and instead focus on direct export to BigQuery.

Note that even if achieve the same schema goal (which we should, if it is not too much work), still the SQL queries against BigQuery would be different from Spark+Parquet. That is because of differences in SQL dialects (and we are working on fhir-views to address that but that is a separate issue).

jjtswan commented 1 year ago

Any updates here?

atulai-sg commented 1 year ago

I have evaluated work for https://github.com/google/fhir-data-pipes/issues/455 and it seems its doable so we might want to take that route instead of doing this.

atulai-sg commented 1 year ago

[INTERNAL]: here is the document explaining the findings and approaches taken so far in an attempt to solve this:

go/bunsen-sql-on-fhir

jjtswan commented 1 year ago

I started to investigate this a bit, and I had a couple of realizations.

TLDR: @atulai-sg - What happens if we try to take the current FHIR-data-pipes parquet output and then use the enableListInference option (see here and here)?

If this works and there's no other differences, then I think we can resolve this bug as we have a path for fhir-data-pipes to have consistent output with GCP's export to BQ.

That said, it's worth looking at the type BigQuery's Parquet type conversions to anticipate other potential differences.

(Hat tip to @omarismail94 if this actually works.)

Details --

BQ & Parquet representation of lists of composite types (e.g. groups / structs) seem to be slightly different.

This is evident from existing GCP BQ documentation: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#parquet_conversions

Case A - In BQ, it prefers to represents this as:

repeated struct <name> {...}

Case B - In Parquet, it prefers to represent this as:

optional group <name> (LIST) {
  repeated group list { ... }
}

Case C - What's interesting here is that if you were to try to translate Parquet's version of list directly into BQ format, I think you would get:

nullable struct <name> {
  repeated struct list {...}
}

Which is what we are seeing.

That said, it does seem like GCP realizes that loading parquet files into BQ as Case C is quite cumbersome and provides enableListInference as an option to auto-detect this situation and interpret this as the cleaner Case A (see here and here).

The current parquet schema output of FHIR-data-pipes looks very close to what GCP list inference looks for

Case D When I run the pipeline, I can use the parq-cli tool to get the schema. Here's the example of the patient name:

  optional group field_id=-1 name (List) {
    repeated group field_id=-1 array {
      ...
    }
  }

You'll notice that this is almost the same as Case B which GCP should be able to detect and convert to the simpler BigQuery schema if we use enableListInference option. The only difference is that the fhir-data-pipes output uses array as the name of the inner group, whereas the example in Case B uses list as the name.

So if enableListInference doesn't work the first time, try renaming it from array to list?

omarismail94 commented 1 year ago

I think we can close this now! We can get the schemas to match and upload successfully to BigQuery with the following command:

bq --project_id=PROJECT_ID \
     --autodetect    \
     --source_format=PARQUET  \
      --parquet_enable_list_inference \
       DATASET_NAME.TABLE_NAME gs://GCS_BUCKET_NAME/PARQUET_WILDCARD

The magic flag is --parquet_enable_list_inference, which uses schema inference for Parquet LIST logical types: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#list_logical_type

jjtswan commented 1 year ago

[putting on my best Bashir impression]

Definitely a lot of progress, but it looks like Atul has found some other differences in how fhir-data-pipes exports in comparison to BQ, and even what is mentioned in the SQL-on-FHIR spec. I think we should leave this open.

@atulai-sg Can you triage the differences, recommend actions (ideally with a work estimate and priority), and consider what kind of testing we might want to put into place? One easier possibility might be to provide create a gold-file test that checks if the parquet schema from the output matches a "golden version" of the schema.

bashir2 commented 1 year ago

Thanks folks for the progress on this, it seems the main issue around Parquet vs BigQuery list handling is resolved, at least in one direction (i.e., Parquet -> BQ) which is great.

Yeah, I agree that to close this issue we need to:

Create a Parquet schema that when imported to BQ produces the same schema as GCP FHIR-store to BQ process; or establish this is not possible without significant efforts.
Do the same in the opposite direction, i.e., for a table generated by FHIR-store to BQ streaming process, export it to Parquet and produce the same schema as the Parquet files generated by our pipelines/Bunsen (or establish ...).

Once we answer the above two, we should close this. I am pretty sure the list of schema differences is larger than just that one list issue (although that was a major one).

jjtswan commented 1 year ago

FYI Bashir - There's already been some differences identified in the document. Some of them are things where our output doesn't follow the SQL-on-FHIR spec (e.g. we're not dropping "Id"s while GCP export is).

bashir2 commented 1 year ago

I think @atulai-sg has discovered some new differences, e.g., around having identifier in References. So I am reopening this to track those issues here as well.

google / fhir-data-pipes

Make Bunsen's SQL-on-FHIR schema consistent with GCP HCLS's #454

BQ & Parquet representation of lists of composite types (e.g. groups / structs) seem to be slightly different.

The current parquet schema output of FHIR-data-pipes looks very close to what GCP list inference looks for