Simpler Hadoop DSL multi-grid builds

convexquad commented 8 years ago

I have observed that users struggle in using Hadoop DSL hadoopClosure and namespace language features to create multi-grid Hadoop DSL workflows.

I have an idea to improve this situation. Essentially, I will provide a new Hadoop DSL mechanism in which you specify multiple definition sets against which to evaluate the Hadoop DSL. The buildAzkabanFlows task will then evaluate the Hadoop DSL for the first definition set, then clear its state, re-evaluate the Hadoop DSL for the second definition set, then clear its state, etc. After each re-evaluation of the Hadoop DSL, the compiled output will be written to a unique output location.

Basically, we'll evaluate the Hadoop DSL for each definitionSet, clearing it in between, and writing the compiled output to a different location each time.

The advantage of this is that it will be incredibly simpler for users from the perspective of the mental model they need. They will just use lookupDef to lookup any values that are different for different grids. Users will not have to use hadoopClosure or namespace language constructs for multi-grid builds (although you can still use these features).

The disadvantage of this is that it will be slower as it will re-evaluate all of your Hadoop DSL for each definitionSet. In addition, you will get WARNING messages from the Hadoop DSL static checker for each re-evaluation. You will also have a lot more compiled output files (a very minor concern).

// It will look something like this. In your build.gradle you will have:
apply from: 'src/main/gradle/definitionSets.gradle'  // First, declare your definition sets

// Now customize your Hadoop DSL build
hadoopDslBuild {
  buildPath "azkaban"

  apply files: [
    'src/main/gradle/workflows1.gradle',
    'src/main/gradle/workflows2.gradle',
    'src/main/gradle/common.gradle'
  ]

  definitionSets: ['holdem', 'war']
}

// Now declare that you want to build the Hadoop DSL when you run your build
build.dependsOn buildAzkabanFlows

// When you build, you will have the following output:
//   ${projectDir}/azkaban/holdem
//   ${projectDir}/azkaban/war

// Now easily declare your Hadoop zips for each grid
hadoopZip {
  zip("azkabanHoldem") {
    from "${projectDir}/azkaban/holdem"
  }
  zip("azkabanWar") {
    from "${projectDir}/azkaban/war"
  }
}

convexquad commented 7 years ago

My first idea was #182 - the idea was to generate a new Hadoop DSL task for every definition set file under src/main/definitions. Then at build time, each of these tasks would clear the state of the Hadoop DSL, execute the Hadoop DSL compilation for that definition set + the user profile + the user's workflow scripts, and place the output in its own subdirectory of the build directory. The implementation worked perfectly!

Unfortunately, this idea broke a number of the other Hadoop Plugin tasks and some of our other internal plugins such as data-mock. The invariant this idea breaks is that the other things assume that the Hadoop DSL workflows are fully setup at Gradle configuration time. In my scheme, the Hadoop DSL is actually empty at configuration time - it's not until the auto-generated subtasks run that the Hadoop DSL gets filled in (and even then, it is cleared between each subtask).

This breaks some of the other tasks, such as the "showPigJobs" task that shows the currently-configured Apache Pig jobs (that have been configured in the Hadoop DSL) and the associated "runPigJobs" task, since there are actually no configured jobs at Gradle configuration time! It's possible to hack together the tasks to make this possible, but it really wasn't going to work that well.

convexquad commented 7 years ago

In the current #183 the user adds their definition set files to src/main/definitions and their profile scripts to src/main/profiles (and workflows scripts to src/main/gradle as before). Then in their build.gradle, all they have to do is:

// Apply the Hadoop Plugin
plugins {
  id 'com.linkedin.gradle.hadoop.HadoopPlugin' version '0.13.1'
}

// Configure Hadoop DSL auto builds and call the autoSetup method
hadoopDslBuild {
  // Can customize any of the paths shown below. Currently shown with the default paths.
  definitions = 'src/main/definitions'
  profiles = 'src/main/profiles'
  workflows = 'src/main/gradle'

  // Displays information about the automatic setup
  showSetup = true
}.autoSetup()  // Call autoSetup() when you are done to setup the Hadoop DSL for each definition set file in the definitions path

// If accepting the default properties, this can be written as a one-liner:
hadoopDslBuild { }.autoSetup()

When autoSetup() is called it completely configures the Hadoop DSL for each definition set file in the definitions path. Each file results in the creation of a Hadoop DSL namespace with the same name as the definition set file and the user's workflows are recreated in each of these namespaces. This should prevent this new feature from breaking any of our existing tools / associated plugins.

linkedin / linkedin-gradle-plugin-for-apache-hadoop

Simpler Hadoop DSL multi-grid builds #115