aaelony commented 9 years ago

Wanted to get your thoughts.

In my use case, I actually need two usage modes; one is like yours, which I affectionately term Upsert/Clobber mode and the other I term Upsert/Storage mode. I'm open to other names as well.

The current behavior is the Clobber mode. It watches s3 buckets that match the patterns in the config and bucket manifest, performs the upsert when file changes are found, and then deletes the s3 files leaving the manifest behind untouched.

Storage mode would be designed for s3 buckets that keep semi-permanent data and do something entirely different. Storage mode would also watch s3 buckets that match the patterns in the config and bucket manifest, but then perform the upsert, not delete the s3 data files but delete the s3 bucket's manifest (upon successful data load into redshift) such that the bucket is no longer watched to avoid loading this data again. This is useful for loading data buckets that contain dated subfolders (e.g. 2015-01-13) and files for that date therein. That way the data is not deleted but loaded into redshift, and need not be loaded again.

Other modes could potentially be created if necessary or desired.

I haven't implemented this yet, but am considering it.

My questions are:

What are your thoughts on this in general?
Do you desire this kind of behavior (or something similar)?
Would you eventually want something like this merged back into Blueshift?
What are your thoughts around semantics of how to specify these Modes? Config file semantics? Manifest file semantics?

Thanks for your thoughts.

-Avram

aaelony commented 9 years ago

Current thoughts on a new specification for manifest and config files to support the two modes of operation above. This is a spec before I code this up (the code in my repo doesn't necessarily support this yet).

Modes of Operation

Normal Mode

Watches S3 buckets with a Manifest file that match the key pattern and data pattern.
Performs upsert on new data when found.
Deletes S3 files from bucket that were upserted into Redshift.
Keeps watching the S3 bucket as per the Manifest file in the bucket.

Example config file (Normal mode)

{:s3 {:credentials   {:access-key "***"
                      :secret-key "***"}
      :bucket        "your-bucket"
      :key-pattern   ".*"
      :poll-interval {:seconds 30}
      :server-side-encryption "AES256"    ;; optional
      }
 :telemetry {:reporters [uswitch.blueshift.telemetry/log-metrics-reporter]}
 :redshift-connections [{:dw1 {:jdbc-url "jdbc:postgresql://foobar...."}}
                        {:dw2 {:jdbc-url "jdbc:postgresql://blahblah.fake.com..."}}
                        ]
 }

Example manifest.edn file (Normal mode)

{:table "mydata_fact"
 :pk-columns ["id" "blah"]
 :columns ["id" "blah" "timestamp" ]
 :database :dw1    ;; must match the config to resolve the jdbc-url
 :options      ["DELIMITER '\\t'" "IGNOREHEADER 1" "GZIP" "TRIMBLANKS" "TRUNCATECOLUMNS"]
 :data-pattern ".*blah.*.gz$"
 :keep-data-pattern-files-on-import false
 :keep-manifest-upon-import true
 }

Alternate Mode

Watches S3 buckets with a Manifest file that match the key pattern and data pattern.
Performs upsert on new data when found.
Does not delete S3 file data.
If the data loads successfully, deletes the Blueshift manifest file in the S3 bucket to stop watching the bucket.

Example config file (Alternate mode)

{:s3 {:credentials   {:access-key "***"
                      :secret-key "***"}
      :bucket        "your-bucket"
      :key-pattern   ".*"
      :poll-interval {:seconds 30}
      :server-side-encryption "AES256"    ;; optional
      }
 :telemetry {:reporters [uswitch.blueshift.telemetry/log-metrics-reporter]}
 :redshift-connections [{:dw1 {:jdbc-url "jdbc:postgresql://foobar...."}}
                        {:dw2 {:jdbc-url "jdbc:postgresql://blahblah.fake.com..."}}
                        ]
 }

Example manifest.edn file (Alternate mode)

{:table "mydata_fact"
 :pk-columns ["id" "blah"]
 :columns ["id" "blah" "timestamp" ]
 :database :dw1    ;; must match the config to resolve the jdbc-url
 :options      ["DELIMITER '\\t'" "IGNOREHEADER 1" "GZIP" "TRIMBLANKS" "TRUNCATECOLUMNS"]
 :data-pattern ".*blah.*.gz$"
 :keep-data-pattern-files-on-import true
 :keep-manifest-upon-import false
 }

Would be great to get uswitch thoughts on this. -A

nullpointerexcept commented 8 years ago

Did you implement this? I don't see it in your fork.

aaelony commented 8 years ago

Hello, yes I have this implemented and I've used it for at least a year. I will definitely make it available once I get the chance to look into it (it's been a while) :)

aaelony commented 8 years ago

update: found the code, will carve out time to create a branch (hopefully today)...

aaelony commented 8 years ago

Okay, it's merged to master branch now. Lmk how it goes... ;)

nullpointerexcept commented 8 years ago

Thank you! Have you looked into a mode that doesn't upsert? If the data doesn't have a primary key, I'd like Blueshift to just do a simple insert.

uswitch / blueshift

Thoughts around new usage "modes" (?) #7

Modes of Operation

Normal Mode

Alternate Mode