spotify / scio

A Scala API for Apache Beam and Google Cloud Dataflow.
https://spotify.github.io/scio
Apache License 2.0
2.55k stars 514 forks source link

Support writing extra metadata in scio-parquet #5411

Open clairemcginty opened 2 months ago

clairemcginty commented 2 months ago

Parquet 0.14 supports setting extra file metadata via ParquetWriter: https://github.com/apache/parquet-java/pull/1241

I added a new metadata output param to the scio-parquet Avro/Magnolify/Tensorflow bindings, which matches the naming convention we use in the scio-avro write APIs.

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 61.27%. Comparing base (1e88ee3) to head (5ef843f).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #5411 +/- ## ========================================== + Coverage 61.24% 61.27% +0.03% ========================================== Files 310 310 Lines 11058 11067 +9 Branches 751 774 +23 ========================================== + Hits 6772 6781 +9 Misses 4286 4286 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

clairemcginty commented 2 months ago

This has some breaking API changes and is not urgent, can probably wait for 0.15 release.