snowplow / dataflow-runner

Run templatable playbooks of Hadoop/Spark/et al jobs on Amazon EMR
http://snowplowanalytics.com
19 stars 8 forks source link

Rely on exponential backoff when requesting the cluster's status #68

Closed oguzhanunlu closed 3 years ago

oguzhanunlu commented 3 years ago

Thanks for the quick review @chuwy , I also wasn't sure about the strategy.

but on ones where we see this throughput error

Yeah it makes sense, first hit of the rate limit ex can be followed by the backoff strategy.

On the strategy, how about min (30 sec) - max (1 min) - factor (1.1) and jitter enabled so that we have 30 sec > 33 sec > 36.3 sec and so on? Compared to constant 30 sec, factor of 1.1 is a bit more polite. What do you think?

but in 4-6 mins the chance is very high, so I think we should be increasing checks frequency with time

I'll check if I can do this without reinventing the wheel

chuwy commented 3 years ago

What I meant is that the backoff strategy should be applied only on faulty pipelines, i.e. if you see a throughput exception, in normal circumstances it would harm more (delay the job).

Another point is that if we're going to use a backoff period - it should be a reverse backoff, i.e. 3 min, 1 min, 30 sec, 20 sec. It reduces the overall amount of calls, but also makes sure the job isn't stuck.

I think this is what we need to do.

  1. Follow above strategy: 3 min, 1 min, 30 sec, 20 sec, 15 sec, 15 sec, 15 sec... (although worth to check with someone if there's statistics on cluster startup time)
  2. Whenever we hit a throughput exception (exactly this one) - instead of failing we add time
oguzhanunlu commented 3 years ago

Continuing at #69