snowplow / dataflow-runner

Run templatable playbooks of Hadoop/Spark/et al jobs on Amazon EMR
http://snowplowanalytics.com
19 stars 8 forks source link

Add Consul-based locking #17

Closed alexanderdean closed 7 years ago

alexanderdean commented 7 years ago

This lets us run specific jobs as singletons, to prevent overlapping runs.

SQL Runner has an implementation of this that we can borrow.

BenFradet commented 7 years ago

Am I right in assuming this and #20 wouldn't work with --async?

alexanderdean commented 7 years ago

I think we support Consul-locking with --async, but just accept that the user will have to deal with deleting the lock manually (just as they'll have to deal with monitoring the job manually).

I agree it's probably unusual to use both together, but I think let's not introduce the code complexity of banning this.

BenFradet commented 7 years ago

I have given more thoughts to this / started implementing it and I feel weird acquiring a lock never to release it.

I'd rather just not lock in case of async and emit a warning or something like that.

BenFradet commented 7 years ago

@alexanderdean

Also since, there are no names associated with the playbook, we can either lock based on:

  1. cluster id + combination of the step names
  2. add a name field to the playbook config (which will have nothing to do with EMR) and do cluster id + playbook name

What do you think? I personally like 2 best.

alexanderdean commented 7 years ago

Fine by me on the --async - I suggest we warn or error out, up to you.

On the lock name - what does SQL Runner do?

BenFradet commented 7 years ago

AFAICT, @jbeemster might able to weigh in here, SQL runner uses the path provided by the command line arg -lock.

I was thinking of abstracting this away from the user without any lock-related command line args.

alexanderdean commented 7 years ago

I was thinking of abstracting this away from the user without any lock-related command line args.

I don't think that gives enough control. For example, one user might want to enforce a lock based on just the job name, while another company might want to enforce a single lock across two different jobs which both target the same database (risking a race condition if they run at the same time).

I'd go with @jbeemster's design - it's simple and flexible...