google / fhir-data-pipes

A collection of tools for extracting FHIR resources and analytics services on top of that data.
https://google.github.io/fhir-data-pipes/
Apache License 2.0
151 stars 84 forks source link

Fix the issue of DirectRunner generating too many Parquet files. #1070

Closed bashir2 closed 4 months ago

bashir2 commented 4 months ago

Description of what I changed

Fixes #1063 by reverting #1047 for non-Dataflow runners. The fix in #1047 made the size of Parqut files a function of Beam's Bundle size which is not a good idea (beside causing #1063 it also impacts the performance of queries on generated Parquet files). So we only do the FinishBundle flush for DataflowRunner which tend to have very large Bundle size.

I could force Parquet files to be closed on Dataflow and also avoid #1063 with DirectRunner by using Cleaner or finalize(). That solution is shown in the first commit of this PR but it is a brutal/hacky solution that depends on garbage-collector. So I reverted that fix and went with the conditional @FinishBundle idea. We may need to more carefully consider #288 again, i.e., to use ParquetIO instead of our own ParquetUtil but as mentioned in this comment it has its own challenges.

E2E test

TESTED:

Ran the pipeline with DirectRunner and confirmed that multiple records end up in a single Parquet files (as expected).

Checklist: I completed these to help reviewers :)

bashir2 commented 4 months ago

So as discussed @chandrashekar-s, I'll add the stopgap solution of doing @FinishBundle only for DataflowRunner but nothing else. I'll update this PR tomorrow.

codecov-commenter commented 4 months ago

Codecov Report

Attention: Patch coverage is 0% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 50.56%. Comparing base (4af9b38) to head (07d594a).

Files Patch % Lines
...a/com/google/fhir/analytics/FetchSearchPageFn.java 0.00% 5 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1070 +/- ## ============================================ - Coverage 50.60% 50.56% -0.05% Complexity 674 674 ============================================ Files 91 91 Lines 5511 5512 +1 Branches 707 708 +1 ============================================ - Hits 2789 2787 -2 - Misses 2461 2464 +3 Partials 261 261 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.