linkedin / linkedin-gradle-plugin-for-apache-hadoop

Apache License 2.0
117 stars 76 forks source link

Convert flows to .yml along with .job for Azkaban Flow 2.0 #193

Closed reallocf closed 6 years ago

reallocf commented 6 years ago

We plan on adapting the HadoopDSL to output two YAML files - .flow and .project - instead of .job/.properties files. Azkaban will read a YAML (.flow) file for each flow as Flow 2.0 is designed and released. The .project file will be used to define Project-level properties, some of which are currently only configurable through the UI (like project permissions).

This allows for easier to define flow-level properties (such as schedules and data-availability based triggers) as well as easier comprehension when trying to understand the generated .zip (job files themselves are hard to read as a cohesive flow unit, especially in large projects).

Note that the hadoopDSL will always able to be configured to output .job files to allow for backward compatibility with old Azkaban versions.

@jamiesjc is in charge of Flow 2.0 and should feel free to add more info :smiley:

reallocf commented 6 years ago

I've built a local working version. It still requires some tweaking, but I should be making related commits soon. I plan on integrating with this repo in a few PRs:

  1. The YamlJob, YamlWorkflow, and YamlProject objects that are used to minimize the Job and Workflow objects for easy snakeyaml conversion. Properties/Property Sets are collapsed into the YamlWorkflow object under its configs section. Also, the PR will include the YamlCompiler object that actually builds and writes the Yaml files. None of these files will be "hooked in" to the codebase. Unit tests for these objects will be included.

  2. Wiring that allows users to decide if they want to output .job/.properties files or .flow/.project (YAML) files. It will expose an API in gradle. The default will be .job/.properties. This PR will also include a few end-to-end tests that output .flow/.project files and cover expected use cases.

Future TODOs:

pranayhasan commented 6 years ago

@reallocf Is having the controlled roll-out plan still a TODO? If not, can you give more details on how we enable it to few users? Currently, the Hadoop Plugin has two sub-projects hadoop-plugin and an internal li-hadoop-plugin specific to Linkedin's use-case. So, you need to hide the custom rollout plan to li-hadoop-plugin. If you're planning to use any internal application specific to Linkedin for Experimentation/AB testing, we can have an offline sych on how we can achieve this.

reallocf commented 6 years ago

@pranayhasan We weren't planning on doing anything as complicated as using Lix or AB testing. In #200 I introduce the ability for users to opt-in to outputting yaml. @jamiesjc and I were planning to reach out to a couple of teams and see what happens when they switch their flows to use yaml. Once we've done that with a couple of larger teams with more complex flows we'll have enough confidence to roll out to everybody.

On top of that, Azkaban's ability to read in .job/.properties files won't ever go away. Users will always be able to specify generateYamlOutput false in their hadoop closure if for whatever reason yaml isn't working for them and they need to release ASAP.

reallocf commented 6 years ago

After second PR merge, this is completed! Will reopen if any tasks come up related to this issue.

reallocf commented 6 years ago

Found out that I need to add similar functionality as that introduced by the second PR to the li-hadoop-plugin to make this work for LinkedIn users.

reallocf commented 6 years ago

208 significantly refactors the prior design for Flow 2.0 integration.

Instead of creating YamlWorkflow, YamlJob, and YamlProject objects, the YamlCompiler does the transformations directly from Workflow and Job objects.

I think YamlProject objects (or something equivalent) could be useful in the future, but aren't necessary when projects are basically just a name (for the file) and a number (2.0 - representing Flow 2.0 being used). Once more features are added into them, an independent object to build an abstraction around that info would be useful I think.

reallocf commented 6 years ago

210 Integrates the li-hadoop-plugin subproject with Flow 2.0

This ticket is now ready to be closed (again) :)

dmvieira commented 5 years ago

We did another project to make Azkaban 3 flows for flow V1 or V2 with plugin support: https://github.com/globocom/auror-core

jamiesjc commented 5 years ago

@dmvieira Your project looks awesome! I'd like to try it out soon. One question, does it also support converting an existing V1 Azkaban project to a V2 one?

dmvieira commented 5 years ago

Thank you @jamiesjc ! Not yet, but I think it's pretty simple to do. Now you can write once in python and write configuration to version 1 and version 2

dmvieira commented 5 years ago

We did a jobtype cookiecutter for faster plugins: https://github.com/globocom/azkaban-jobtype-cookiecutter and we already did an example here: https://github.com/globocom/azkaban-jobtype-email

HappyRay commented 5 years ago

@dmvieira Nice work. I plan to take a closer look when I get a chance.

dmvieira commented 5 years ago

@jamiesjc I did an issue for that: https://github.com/globocom/auror-core/issues/3 😉