Closed bashir2 closed 4 months ago
So as discussed @chandrashekar-s, I'll add the stopgap solution of doing @FinishBundle
only for DataflowRunner
but nothing else. I'll update this PR tomorrow.
Attention: Patch coverage is 0%
with 6 lines
in your changes are missing coverage. Please review.
Project coverage is 50.56%. Comparing base (
4af9b38
) to head (07d594a
).
Files | Patch % | Lines |
---|---|---|
...a/com/google/fhir/analytics/FetchSearchPageFn.java | 0.00% | 5 Missing and 1 partial :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Description of what I changed
Fixes #1063 by reverting #1047 for non-Dataflow runners. The fix in #1047 made the size of Parqut files a function of Beam's Bundle size which is not a good idea (beside causing #1063 it also impacts the performance of queries on generated Parquet files). So we only do the FinishBundle flush for
DataflowRunner
which tend to have very large Bundle size.I could force Parquet files to be closed on Dataflow and also avoid #1063 with
DirectRunner
by usingCleaner
orfinalize()
. That solution is shown in the first commit of this PR but it is a brutal/hacky solution that depends on garbage-collector. So I reverted that fix and went with the conditional@FinishBundle
idea. We may need to more carefully consider #288 again, i.e., to useParquetIO
instead of our ownParquetUtil
but as mentioned in this comment it has its own challenges.E2E test
TESTED:
Ran the pipeline with DirectRunner and confirmed that multiple records end up in a single Parquet files (as expected).
Checklist: I completed these to help reviewers :)
[x] I have read and will follow the review process.
[x] I am familiar with Google Style Guides for the language I have coded in.
No? Please take some time and review Java and Python style guides.
[x] My IDE is configured to follow the Google code styles.
No? Unsure? -> configure your IDE.
[ ] I have added tests to cover my changes. (If you refactored existing code that was well tested you do not have to add tests)
[x] I ran
mvn clean package
right before creating this pull request and added all formatting changes to my commit.[x] All new and existing tests passed.
[x] My pull request is based on the latest changes of the master branch.
No? Unsure? -> execute command
git pull --rebase upstream master