Common: simplify configuration

snowplow / snowplow-rdb-loader

Stores Snowplow enriched events in Redshift, Snowflake and Databricks

Other

31 stars 16 forks source link

Common: simplify configuration #21

Open chuwy opened 7 years ago

chuwy commented 7 years ago

Migrated from https://github.com/snowplow/snowplow/issues/3279

alexanderdean commented 7 years ago

Needs further thought where we go here.

chuwy commented 6 years ago

Example JSON with all required properties:

{
  "schema": "iglu:com.snowplowanalytics.rdbloader/config/jsonschema/1-0-0",
  "data": {
    "buckets": {
      "shreddedGood": "s3://snowplow-acme-processing/shredded/good/",
      "log": "s3://snowplow-acme-processing/logs/",
      "jsonpathAssets": "s3://snowplow-acme-assets/optional/"
    },
    "outputCompression": "GZIP",
    "rdbShredder": "0.13.0",
    "tracking": {
      "method": "POST",
      "appId": "acme-loading",
      "collector": "collector.acme.com:8080"
    }
  }
}

buckets.jsonpathAssets and tracking are optional.

Uses own vendor following dataflow runner and factotum examples.

alexanderdean commented 6 years ago

I think this is a good start! Comments:

Doesn't outputCompression relate to the contents of shreddedGood?
Shouldn't some of this be nested into a Redshift-specific sub-object?

chuwy commented 6 years ago

Doesn't outputCompression relate to the contents of shreddedGood?

Yep, probably. But as much as rdbShredder.

Shouldn't some of this be nested into a Redshift-specific sub-object?

Just jsonpathAssets, I believe? But for me JSONPaths a) although used only in Redshift - can be considered general as general tool; b) can be removed at all in future - I wouldn't consider them important enough to re-structure property that can be simple string - into complex object.

Another example could look like:

{
  "schema": "iglu:com.snowplowanalytics.rdbloader/config/jsonschema/1-0-0",
  "data": {
    "shredded": {
      "bucket": "s3://snowplow-acme-processing/shredded/good/",
      "compression": "GZIP",
      "shredder": "0.13.0"
    },
    "jsonpaths": "s3://snowplow-acme-assets/optional/",
    "logs": "s3://snowplow-acme-processing/logs/",
    "tracking": {
      "method": "POST",
      "appId": "acme-loading",
      "collector": "collector.acme.com:8080"
    }
  }
}

alexanderdean commented 6 years ago

Thanks @chuwy - I like the suggested alternative, it feels like a more coherent grouping...

chuwy commented 6 years ago

This should be a Shredder/Loader common configuration. As opposed to our target configuration, which is a low-level connection description - this is an application configuration.

Same should be done for strawberry: right now single config (that supposed to describe connection) includes unnecessary details about manifest and enriched archive.

alexanderdean commented 6 years ago

Totally agree. It's worth looking at the "other" RT apps, like the Scala Stream Collector and Stream Enrich. That has to be the direction of travel...

chuwy commented 6 years ago

Pushing this back as it'll require associated Snowplow/EmrEtlRunner release.

chuwy commented 6 years ago

Though, we can implement it in backward-compatible way, e.g. new Loader can understand following options:

--target-config $TARGET_JSON and --loader-config $LOADER_JSON for upcoming EER
--config $CONFIG_YML with deprecation warning for older/current EER

EmrEtlRunner also has to decide what format to use.

alexanderdean commented 6 years ago

Yes I like that approach. Good to push back though.