Rely on exponential backoff when requesting the cluster's status

oguzhanunlu commented 3 years ago

Thanks for the quick review @chuwy , I also wasn't sure about the strategy.

but on ones where we see this throughput error

Yeah it makes sense, first hit of the rate limit ex can be followed by the backoff strategy.

On the strategy, how about min (30 sec) - max (1 min) - factor (1.1) and jitter enabled so that we have 30 sec > 33 sec > 36.3 sec and so on? Compared to constant 30 sec, factor of 1.1 is a bit more polite. What do you think?

but in 4-6 mins the chance is very high, so I think we should be increasing checks frequency with time

I'll check if I can do this without reinventing the wheel

chuwy commented 3 years ago

What I meant is that the backoff strategy should be applied only on faulty pipelines, i.e. if you see a throughput exception, in normal circumstances it would harm more (delay the job).

Another point is that if we're going to use a backoff period - it should be a reverse backoff, i.e. 3 min, 1 min, 30 sec, 20 sec. It reduces the overall amount of calls, but also makes sure the job isn't stuck.

I think this is what we need to do.

Follow above strategy: 3 min, 1 min, 30 sec, 20 sec, 15 sec, 15 sec, 15 sec... (although worth to check with someone if there's statistics on cluster startup time)
Whenever we hit a throughput exception (exactly this one) - instead of failing we add time

oguzhanunlu commented 3 years ago

Continuing at #69

snowplow / dataflow-runner

Rely on exponential backoff when requesting the cluster's status #68