snowplow / dataflow-runner

Run templatable playbooks of Hadoop/Spark/et al jobs on Amazon EMR
http://snowplowanalytics.com
19 stars 8 forks source link

Release 0.4.2 #60

Closed istreeter closed 4 years ago

istreeter commented 4 years ago

This change was required to let us run flink applications on EMR using dataflow-runner.

coveralls commented 4 years ago

Coverage Status

Coverage decreased (-0.05%) to 93.973% when pulling 75bbcef8bbdafcc20df67f832711cb6c70a526a7 on release-0.4.2 into b83d70364ddef7030cc4dd4d9390c89066c16004 on master.

istreeter commented 4 years ago

I had mixed feelings about expanding the list of applications. Here are some of the thoughts going round in my head:

  1. This is an open-source tool written to be deliberately independent of all other Snowplow products. So we (Snowplow) might not need the other applications, but I feel the tool should support them.
  2. Maybe we should abandon the list of allowed applications altogether. If the command line contains an invalid application then aws will respond with an error message. Maybe there is no need for this tool to do an additional check.
  3. ...and I would also like to know why the list was limited in the first place.
colmsnowplow commented 4 years ago

It might be relevant that there are schemas for the configs, btw (more of an FYI): https://github.com/snowplow/iglu-central/tree/master/schemas/com.snowplowanalytics.dataflowrunner

I share both of your doubt - not knowing why it was there in the first place leaves me in doubt as to whether we're missing something.

Re: supporting other applications, there's also the factor that including them implies that they're tested and supported, and there's a maintenance overhead to that. If the demand to support another application is there, I think I'd rather encourage that people raise issues and pull requests than accidentally create an expectation. However I don't feel strongly about this, it's a vague gut feeling more than anything.

paulboocock commented 4 years ago

When I consider these two points you've just made:

there's also the factor that including them implies that they're tested and supported

Maybe we should abandon the list of allowed applications altogether. If the command line contains an invalid application then aws will respond with an error message.

I lean towards removing the check altogether. By listing them, I think it suggests they are supported but by removing the check then we are making no promises in the code that we support all these options (and README/guides would only include instructions for those we know work and are tested). (still wonder why it was added in the first place though...)

colmsnowplow commented 4 years ago

@chuwy have you got an opinion on the above conversation?

chuwy commented 4 years ago

I agree with Paul and Ian on "removing the check altogether" and wouldn't be too concerned about why we added this list initially. Or I rather have too many ideas why we added it and only one of them (documentation purposes) is worth to consider.

Also, I really doubt we ever used/tested Pig, Mahout or even Hive with Dataflow Runner.

istreeter commented 4 years ago

I removed the check of valid application names. This is now the error would receive if you run dataflow-runner up with an invalid application name:

ERRO[0000] ValidationException: Specified application: Snowplow is invalid.
        status code: 400, request id: bdb66ae0-30a2-4b2a-b464-77b641dcfaa4
ValidationException: Specified application: Snowplow is invalid.
        status code: 400, request id: bdb66ae0-30a2-4b2a-b464-77b641dcfaa4

The exit code is 1.