snowplow / kinesis-tee

Unix tee, but for Kinesis streams
http://snowplowanalytics.com/
12 stars 5 forks source link

Add daisychaining of transformation and filtering steps #18

Open alexanderdean opened 8 years ago

alexanderdean commented 8 years ago

Allowing for multiple transformations, and filters based on each transformation

nakulgan commented 7 years ago

@alexanderdean @ninjabear This feature will add support to chain Transforms and Filters in a sequence. With the assumption that the user is aware of the data inputs at each step.

I propose the following sample config

{
  "name": "My Kinesis Tee example",
  "targetStream": {
    "name": "my-target-stream",
    "targetAccount": {
      "com.snowplowanalytics.kinesistee.config.TargetAccount": {
        "awsAccessKey": "*",
        "awsSecretAccessKey": "*",
        "region": "eu-west-1"
      }
    }
  },
  "operators": {
    "com.snowplowanalytics.kinesistee.config.Operators": [{
      "type": "TRANSFORM_BUILT_IN",
      "value": "..."
    },{
      "type": "TRANSFORM_BUILT_IN",
      "value": "..."
    },{
      "type": "FILTER_JAVASCRIPT",
      "value": "..."
    }]
  }
}

The probable downside to this approach being that we would treat Transformers and Filters alike as an operator.

The alternative approach being adding additional metadata. One simple way to do this would be by adding "stepOrder" metadata to the Transform or Filter.

{
  "schema": "iglu:com.snowplowanalytics.kinesistee.tbd",
  "data": {
    "name": "My Kinesis Tee example",
    "targetStream": {
      "name": "my-target-stream",
      "targetAccount": null
    },
    "transformer": {
      "com.snowplowanalytics.kinesistee.config.Transformer": [{
        "type": "TRANSFORM_BUILT_IN",
        "value": "SNOWPLOW_ENRICHED_EVENT_TO_NESTED_JSON",
        "stepOrder": 3
      },{
        "type": "TRANSFORM_JAVASCRIPT",
        "value": "...",
        "stepOrder": 1
      }]
    },
    "filter": {
      "com.snowplowanalytics.kinesistee.config.Filter": [{
          "type": "FILTER_JAVASCRIPT",
          "value": "...",
          "stepOrder": 2
        }]
    }
  }
}

Although I lean towards the former approach, the latter might be more suitable later in the roadmap, to support advanced daisy-chaining patterns such as "andThen" or "dependsOn"

alexanderdean commented 7 years ago

Thanks @nakulgan ! I see the dilemma: option 1 overloads the type field a lot, while option 2 requires an out-of-band index which could is quite brittle for humans...

@ninjabear what do you reckon?

ninjabear commented 7 years ago

Option 1 leaves the possibility for Option 2 open in the future, and it's obvious what's going on to the reader. 1) has my vote!

alexanderdean commented 7 years ago

Okay - let's go with option 1!

nakulgan commented 7 years ago

Cool, thanks guys.