snowplow / dataflow-runner

Run templatable playbooks of Hadoop/Spark/et al jobs on Amazon EMR
http://snowplowanalytics.com
19 stars 8 forks source link

Make steps immediately submitted to transient cluster #38

Closed chuwy closed 2 years ago

chuwy commented 6 years ago

When we're running playbook with run-transient, dataflow-runner first starts a cluster and only when cluster is running submits steps from playbook. This can lead to race conditions when config files synced/deployed after dataflow-runner started, but before cluster started.

Also, this will prevent failures where playbook refers to other file (base64File for example) that is not available. Right now it starts a cluster, sees that file is unavailable and terminates cluster, whereas it could give an error without launching cluster.

jbeemster commented 6 years ago

I think we had the same issue with SQL Runner - the solution was to template and store the steps in memory at launch rather than getting them at time. This might be simpler than altering the logic for launching a new cluster potentially?

chuwy commented 6 years ago

Both options look good for me, I don't have strong preferences.

alexanderdean commented 6 years ago

Can you submit steps to a cluster before it's started?

chuwy commented 6 years ago

I thought EmrEtlRunner does it?

chuwy commented 6 years ago

E.g. I see steps even on clusters that were failed during validation.

alexanderdean commented 6 years ago

Got it, then it feels like best is: