spotify / ratatool

A tool for data sampling, data generation, and data diffing
Apache License 2.0
341 stars 55 forks source link

make bigsampler bq output partition configurable #705

Closed benkonz closed 6 months ago

benkonz commented 6 months ago

adds a new arg to BigSampler called bigqueryPartitioning, defaults to "DAY", which should maintain the same behavior as before. Users can pass in "DAY|HOUR|MONTH|YEAR", as well as NULL if no table partitioning is desired.

Making this change so that Ratatool works better with Spotify's internal Luigi BigQuery tasks, which use table sharding as partitioning, and when ratatool sets the partitioning to ingestion day, it causes problems with retention.

Tested by outputting this table via this workflow:

apiVersion: workflow.data.spotify.com/v1alpha1
kind: Workflow
metadata:
  name: ratatool-internal-examples-stream-days-bigsampler
  namespace: data-quality-spotify
spec:
  resourceID: ratatool-internal-examples.stream.days.BigSampler
  componentID: ratatool-internal
  scheduling:
    schedule: daily
  serviceAccountRef:
    external: contours-test-pipeline@data-quality-spotify.iam.gserviceaccount.com
  docker:
    args:
      - 'wrap-luigi'
      - '--module'
      - 'luigi_tasks'
      - 'BigSampler'
      - '--uri-prefix'
      - 'gs://benk-playground'
      - '--project'
      - 'data-quality-spotify'
      - '--service-account'
      - 'contours-test-pipeline@data-quality-spotify.iam.gserviceaccount.com'
      - '--input-endpoint'
      - 'spotify-people:groups.groups_%Y%m%d'
      - '--output-endpoint'
      - 'data-quality-spotify:benk_test_eu.benk_test_eu_%Y%m%d'
      - '--sample'
      - '0.01'
      - '--partition'
      - '{}'
    terminationLogging: true
    image: 43ea5c916cd5a85623bf0de598da15982c29d8952dbf63a068d10e5b56466e61
  workflowAlertingDisabled: true

the 43ea5c916cd5a85623bf0de598da15982c29d8952dbf63a068d10e5b56466e61 docker image is using my local ratatool PR's code via sbt publishM2

the linked table has to partitioning and uses the sharding generated by the BigQueryTarget in Luigi

here is another table that is using the --bigquery-partitioning arg to set the partitioning to "MONTH".

codecov[bot] commented 6 months ago

Codecov Report

Attention: Patch coverage is 25.00000% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 70.91%. Comparing base (f637e88) to head (6ed1bbf). Report is 8 commits behind head on master.

Files Patch % Lines
...ala/com/spotify/ratatool/samplers/BigSampler.scala 33.33% 4 Missing :warning:
...spotify/ratatool/samplers/BigSamplerBigQuery.scala 0.00% 2 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #705 +/- ## ========================================== - Coverage 71.09% 70.91% -0.18% ========================================== Files 44 44 Lines 1816 1822 +6 Branches 292 301 +9 ========================================== + Hits 1291 1292 +1 - Misses 525 530 +5 ``` | [Flag](https://app.codecov.io/gh/spotify/ratatool/pull/705/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=spotify) | Coverage Δ | | |---|---|---| | [ratatoolCli](https://app.codecov.io/gh/spotify/ratatool/pull/705/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=spotify) | `2.90% <0.00%> (-0.02%)` | :arrow_down: | | [ratatoolCommon](https://app.codecov.io/gh/spotify/ratatool/pull/705/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=spotify) | `0.00% <ø> (ø)` | | | [ratatoolDiffy](https://app.codecov.io/gh/spotify/ratatool/pull/705/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=spotify) | `32.73% <0.00%> (-0.13%)` | :arrow_down: | | [ratatoolExamples](https://app.codecov.io/gh/spotify/ratatool/pull/705/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=spotify) | `17.34% <0.00%> (-0.07%)` | :arrow_down: | | [ratatoolSampling](https://app.codecov.io/gh/spotify/ratatool/pull/705/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=spotify) | `62.11% <25.00%> (-0.26%)` | :arrow_down: | | [ratatoolScalacheck](https://app.codecov.io/gh/spotify/ratatool/pull/705/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=spotify) | `78.14% <ø> (ø)` | | | [ratatoolShapeless](https://app.codecov.io/gh/spotify/ratatool/pull/705/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=spotify) | `4.18% <0.00%> (-0.02%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=spotify#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.