Closed bashir2 closed 1 year ago
Discussing with @atulai-sg, here is my suggestion for how to make decisions and going forward re. this issue:
First evaluate whether it is even possible to create a Parquet schema where uploading that Parquet file to BigQuery, creates the same schema that GCP FHIR-store->BigQuery process creates. The question is specifically around repeated/nested fields.
If the answer to the above question is "yes", then evaluate how much time is needed to address all schema differences, as there seems to be many differences. Based on this evaluation decide whether to proceed or not.
At the same time, also evaluate how much time is needed to fix #455. If we can't make the Parquet schema consistent with BigQuery OR it will take similar amount of time to fixing #455, stop working on this issue and instead focus on direct export to BigQuery.
Note that even if achieve the same schema goal (which we should, if it is not too much work), still the SQL queries against BigQuery would be different from Spark+Parquet. That is because of differences in SQL dialects (and we are working on fhir-views to address that but that is a separate issue).
Any updates here?
I have evaluated work for https://github.com/google/fhir-data-pipes/issues/455 and it seems its doable so we might want to take that route instead of doing this.
[INTERNAL]: here is the document explaining the findings and approaches taken so far in an attempt to solve this:
I started to investigate this a bit, and I had a couple of realizations.
TLDR: @atulai-sg - What happens if we try to take the current FHIR-data-pipes parquet output and then use the enableListInference
option (see here and here)?
If this works and there's no other differences, then I think we can resolve this bug as we have a path for fhir-data-pipes to have consistent output with GCP's export to BQ.
That said, it's worth looking at the type BigQuery's Parquet type conversions to anticipate other potential differences.
(Hat tip to @omarismail94 if this actually works.)
Details --
This is evident from existing GCP BQ documentation: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#parquet_conversions
Case A - In BQ, it prefers to represents this as:
repeated struct <name> {...}
Case B - In Parquet, it prefers to represent this as:
optional group <name> (LIST) {
repeated group list { ... }
}
Case C - What's interesting here is that if you were to try to translate Parquet's version of list directly into BQ format, I think you would get:
nullable struct <name> {
repeated struct list {...}
}
Which is what we are seeing.
That said, it does seem like GCP realizes that loading parquet files into BQ as Case C is quite cumbersome and provides enableListInference
as an option to auto-detect this situation and interpret this as the cleaner Case A (see here and here).
Case D When I run the pipeline, I can use the parq-cli
tool to get the schema. Here's the example of the patient name:
optional group field_id=-1 name (List) {
repeated group field_id=-1 array {
...
}
}
You'll notice that this is almost the same as Case B which GCP should be able to detect and convert to the simpler BigQuery schema if we use enableListInference
option. The only difference is that the fhir-data-pipes output uses array
as the name of the inner group, whereas the example in Case B uses list
as the name.
So if enableListInference
doesn't work the first time, try renaming it from array
to list
?
I think we can close this now! We can get the schemas to match and upload successfully to BigQuery with the following command:
bq --project_id=PROJECT_ID \
--autodetect \
--source_format=PARQUET \
--parquet_enable_list_inference \
DATASET_NAME.TABLE_NAME gs://GCS_BUCKET_NAME/PARQUET_WILDCARD
The magic flag is --parquet_enable_list_inference
, which uses schema inference for Parquet LIST logical types: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#list_logical_type
[putting on my best Bashir impression]
Definitely a lot of progress, but it looks like Atul has found some other differences in how fhir-data-pipes exports in comparison to BQ, and even what is mentioned in the SQL-on-FHIR spec. I think we should leave this open.
@atulai-sg Can you triage the differences, recommend actions (ideally with a work estimate and priority), and consider what kind of testing we might want to put into place? One easier possibility might be to provide create a gold-file test that checks if the parquet schema from the output matches a "golden version" of the schema.
Thanks folks for the progress on this, it seems the main issue around Parquet vs BigQuery list handling is resolved, at least in one direction (i.e., Parquet -> BQ) which is great.
Yeah, I agree that to close this issue we need to:
Once we answer the above two, we should close this. I am pretty sure the list of schema differences is larger than just that one list issue (although that was a major one).
FYI Bashir - There's already been some differences identified in the document. Some of them are things where our output doesn't follow the SQL-on-FHIR spec (e.g. we're not dropping "Id"s while GCP export is).
I think @atulai-sg has discovered some new differences, e.g., around having identifier
in References. So I am reopening this to track those issues here as well.
There are currently some differences between Bunsen's SQL-on-FHIR schema and what the GCP HCLS tools create in BigQuery. We should make these two consistent.