spotify / scio

A Scala API for Apache Beam and Google Cloud Dataflow.
https://spotify.github.io/scio
Apache License 2.0
2.56k stars 513 forks source link

Support parameterized rate limit via ValueProvider for RateLimiterDoFn #4513

Open richpodraza opened 2 years ago

richpodraza commented 2 years ago

I would like to be able to use RateLimiterDoFn but have the rate itself (the recordsPerSecond value) provided as a runtime parameter via PipelineOptions. This allows for tweaking the rate without recompiling the template.

To keep the classes simple and not break existing RateLimiterDoFn use, I suggest adding a new class ValueProvidedRateLimiterDoFn. It would essentially be the same as RateLimiterDoFn, but change the type of recordsPerSecond to ValueProvider. createResource would call recordsPerSecond.get().

I'm happy to submit a PR if someone can let me know if this is an acceptable idea.

RustedBones commented 2 years ago

Hi @richpodraza and sorry for the delayed answer.

RateLimiterDoFn can potentially be used in several places in the same pipeline: we can imagine a pipeline that branches into 2 SCollections, each having their own pace. Setting the value in the pipeline options implies we can only get a single, global limit.

@clairemcginty @kellen @farzad-sedghi any views on that ?

richpodraza commented 2 years ago

Thanks @RustedBones for your reply. I agree in a case like you described, it could be beneficial to use RateLimiterDoFn as-is. I'm suggesting another DoFn be added to the library, not to modify or replace RateLimiterDoFn. My main motivation (for providing as an option) is especially for when knowing what the exact rate limit should be is unknown, and allow for easier trial and error, instead of requiring a rebuild/redeploy.