Closed convexquad closed 7 years ago
My first idea was #182 - the idea was to generate a new Hadoop DSL task for every definition set file under src/main/definitions. Then at build time, each of these tasks would clear the state of the Hadoop DSL, execute the Hadoop DSL compilation for that definition set + the user profile + the user's workflow scripts, and place the output in its own subdirectory of the build directory. The implementation worked perfectly!
Unfortunately, this idea broke a number of the other Hadoop Plugin tasks and some of our other internal plugins such as data-mock. The invariant this idea breaks is that the other things assume that the Hadoop DSL workflows are fully setup at Gradle configuration time. In my scheme, the Hadoop DSL is actually empty at configuration time - it's not until the auto-generated subtasks run that the Hadoop DSL gets filled in (and even then, it is cleared between each subtask).
This breaks some of the other tasks, such as the "showPigJobs" task that shows the currently-configured Apache Pig jobs (that have been configured in the Hadoop DSL) and the associated "runPigJobs" task, since there are actually no configured jobs at Gradle configuration time! It's possible to hack together the tasks to make this possible, but it really wasn't going to work that well.
In the current #183 the user adds their definition set files to src/main/definitions
and their profile scripts to src/main/profiles
(and workflows scripts to src/main/gradle
as before). Then in their build.gradle, all they have to do is:
// Apply the Hadoop Plugin
plugins {
id 'com.linkedin.gradle.hadoop.HadoopPlugin' version '0.13.1'
}
// Configure Hadoop DSL auto builds and call the autoSetup method
hadoopDslBuild {
// Can customize any of the paths shown below. Currently shown with the default paths.
definitions = 'src/main/definitions'
profiles = 'src/main/profiles'
workflows = 'src/main/gradle'
// Displays information about the automatic setup
showSetup = true
}.autoSetup() // Call autoSetup() when you are done to setup the Hadoop DSL for each definition set file in the definitions path
// If accepting the default properties, this can be written as a one-liner:
hadoopDslBuild { }.autoSetup()
When autoSetup()
is called it completely configures the Hadoop DSL for each definition set file in the definitions
path. Each file results in the creation of a Hadoop DSL namespace
with the same name as the definition set file and the user's workflows are recreated in each of these namespaces. This should prevent this new feature from breaking any of our existing tools / associated plugins.
I have observed that users struggle in using Hadoop DSL
hadoopClosure
andnamespace
language features to create multi-grid Hadoop DSL workflows.I have an idea to improve this situation. Essentially, I will provide a new Hadoop DSL mechanism in which you specify multiple definition sets against which to evaluate the Hadoop DSL. The
buildAzkabanFlows
task will then evaluate the Hadoop DSL for the first definition set, then clear its state, re-evaluate the Hadoop DSL for the second definition set, then clear its state, etc. After each re-evaluation of the Hadoop DSL, the compiled output will be written to a unique output location.Basically, we'll evaluate the Hadoop DSL for each
definitionSet
, clearing it in between, and writing the compiled output to a different location each time.The advantage of this is that it will be incredibly simpler for users from the perspective of the mental model they need. They will just use
lookupDef
to lookup any values that are different for different grids. Users will not have to usehadoopClosure
ornamespace
language constructs for multi-grid builds (although you can still use these features).The disadvantage of this is that it will be slower as it will re-evaluate all of your Hadoop DSL for each
definitionSet
. In addition, you will get WARNING messages from the Hadoop DSL static checker for each re-evaluation. You will also have a lot more compiled output files (a very minor concern).