uswitch / blueshift

Automate copying data from S3 into Amazon Redshift
Eclipse Public License 1.0
118 stars 19 forks source link

Thoughts around new usage "modes" (?) #7

Open aaelony opened 9 years ago

aaelony commented 9 years ago

Wanted to get your thoughts.

In my use case, I actually need two usage modes; one is like yours, which I affectionately term Upsert/Clobber mode and the other I term Upsert/Storage mode. I'm open to other names as well.

The current behavior is the Clobber mode. It watches s3 buckets that match the patterns in the config and bucket manifest, performs the upsert when file changes are found, and then deletes the s3 files leaving the manifest behind untouched.

Storage mode would be designed for s3 buckets that keep semi-permanent data and do something entirely different. Storage mode would also watch s3 buckets that match the patterns in the config and bucket manifest, but then perform the upsert, not delete the s3 data files but delete the s3 bucket's manifest (upon successful data load into redshift) such that the bucket is no longer watched to avoid loading this data again. This is useful for loading data buckets that contain dated subfolders (e.g. 2015-01-13) and files for that date therein. That way the data is not deleted but loaded into redshift, and need not be loaded again.

Other modes could potentially be created if necessary or desired.

I haven't implemented this yet, but am considering it.

My questions are:

  1. What are your thoughts on this in general?
  2. Do you desire this kind of behavior (or something similar)?
  3. Would you eventually want something like this merged back into Blueshift?
  4. What are your thoughts around semantics of how to specify these Modes? Config file semantics? Manifest file semantics?

Thanks for your thoughts.

-Avram

aaelony commented 9 years ago

Current thoughts on a new specification for manifest and config files to support the two modes of operation above. This is a spec before I code this up (the code in my repo doesn't necessarily support this yet).

Modes of Operation


Normal Mode

Example config file (Normal mode)

{:s3 {:credentials   {:access-key "***"
                      :secret-key "***"}
      :bucket        "your-bucket"
      :key-pattern   ".*"
      :poll-interval {:seconds 30}
      :server-side-encryption "AES256"    ;; optional
      }
 :telemetry {:reporters [uswitch.blueshift.telemetry/log-metrics-reporter]}
 :redshift-connections [{:dw1 {:jdbc-url "jdbc:postgresql://foobar...."}}
                        {:dw2 {:jdbc-url "jdbc:postgresql://blahblah.fake.com..."}}
                        ]
 }

Example manifest.edn file (Normal mode)

{:table "mydata_fact"
 :pk-columns ["id" "blah"]
 :columns ["id" "blah" "timestamp" ]
 :database :dw1    ;; must match the config to resolve the jdbc-url
 :options      ["DELIMITER '\\t'" "IGNOREHEADER 1" "GZIP" "TRIMBLANKS" "TRUNCATECOLUMNS"]
 :data-pattern ".*blah.*.gz$"
 :keep-data-pattern-files-on-import false
 :keep-manifest-upon-import true
 }

Alternate Mode

Example config file (Alternate mode)

{:s3 {:credentials   {:access-key "***"
                      :secret-key "***"}
      :bucket        "your-bucket"
      :key-pattern   ".*"
      :poll-interval {:seconds 30}
      :server-side-encryption "AES256"    ;; optional
      }
 :telemetry {:reporters [uswitch.blueshift.telemetry/log-metrics-reporter]}
 :redshift-connections [{:dw1 {:jdbc-url "jdbc:postgresql://foobar...."}}
                        {:dw2 {:jdbc-url "jdbc:postgresql://blahblah.fake.com..."}}
                        ]
 }

Example manifest.edn file (Alternate mode)

{:table "mydata_fact"
 :pk-columns ["id" "blah"]
 :columns ["id" "blah" "timestamp" ]
 :database :dw1    ;; must match the config to resolve the jdbc-url
 :options      ["DELIMITER '\\t'" "IGNOREHEADER 1" "GZIP" "TRIMBLANKS" "TRUNCATECOLUMNS"]
 :data-pattern ".*blah.*.gz$"
 :keep-data-pattern-files-on-import true
 :keep-manifest-upon-import false
 }

Would be great to get uswitch thoughts on this. -A

nullpointerexcept commented 8 years ago

Did you implement this? I don't see it in your fork.

aaelony commented 8 years ago

Hello, yes I have this implemented and I've used it for at least a year. I will definitely make it available once I get the chance to look into it (it's been a while) :)

aaelony commented 8 years ago

update: found the code, will carve out time to create a branch (hopefully today)...

aaelony commented 8 years ago

Okay, it's merged to master branch now. Lmk how it goes... ;)

nullpointerexcept commented 8 years ago

Thank you! Have you looked into a mode that doesn't upsert? If the data doesn't have a primary key, I'd like Blueshift to just do a simple insert.