Open aaelony opened 9 years ago
Current thoughts on a new specification for manifest and config files to support the two modes of operation above. This is a spec before I code this up (the code in my repo doesn't necessarily support this yet).
Example config file (Normal mode)
{:s3 {:credentials {:access-key "***"
:secret-key "***"}
:bucket "your-bucket"
:key-pattern ".*"
:poll-interval {:seconds 30}
:server-side-encryption "AES256" ;; optional
}
:telemetry {:reporters [uswitch.blueshift.telemetry/log-metrics-reporter]}
:redshift-connections [{:dw1 {:jdbc-url "jdbc:postgresql://foobar...."}}
{:dw2 {:jdbc-url "jdbc:postgresql://blahblah.fake.com..."}}
]
}
Example manifest.edn file (Normal mode)
{:table "mydata_fact"
:pk-columns ["id" "blah"]
:columns ["id" "blah" "timestamp" ]
:database :dw1 ;; must match the config to resolve the jdbc-url
:options ["DELIMITER '\\t'" "IGNOREHEADER 1" "GZIP" "TRIMBLANKS" "TRUNCATECOLUMNS"]
:data-pattern ".*blah.*.gz$"
:keep-data-pattern-files-on-import false
:keep-manifest-upon-import true
}
Example config file (Alternate mode)
{:s3 {:credentials {:access-key "***"
:secret-key "***"}
:bucket "your-bucket"
:key-pattern ".*"
:poll-interval {:seconds 30}
:server-side-encryption "AES256" ;; optional
}
:telemetry {:reporters [uswitch.blueshift.telemetry/log-metrics-reporter]}
:redshift-connections [{:dw1 {:jdbc-url "jdbc:postgresql://foobar...."}}
{:dw2 {:jdbc-url "jdbc:postgresql://blahblah.fake.com..."}}
]
}
Example manifest.edn file (Alternate mode)
{:table "mydata_fact"
:pk-columns ["id" "blah"]
:columns ["id" "blah" "timestamp" ]
:database :dw1 ;; must match the config to resolve the jdbc-url
:options ["DELIMITER '\\t'" "IGNOREHEADER 1" "GZIP" "TRIMBLANKS" "TRUNCATECOLUMNS"]
:data-pattern ".*blah.*.gz$"
:keep-data-pattern-files-on-import true
:keep-manifest-upon-import false
}
Would be great to get uswitch thoughts on this. -A
Did you implement this? I don't see it in your fork.
Hello, yes I have this implemented and I've used it for at least a year. I will definitely make it available once I get the chance to look into it (it's been a while) :)
update: found the code, will carve out time to create a branch (hopefully today)...
Okay, it's merged to master branch now. Lmk how it goes... ;)
Thank you! Have you looked into a mode that doesn't upsert? If the data doesn't have a primary key, I'd like Blueshift to just do a simple insert.
Wanted to get your thoughts.
In my use case, I actually need two usage modes; one is like yours, which I affectionately term Upsert/Clobber mode and the other I term Upsert/Storage mode. I'm open to other names as well.
The current behavior is the Clobber mode. It watches s3 buckets that match the patterns in the config and bucket manifest, performs the upsert when file changes are found, and then deletes the s3 files leaving the manifest behind untouched.
Storage mode would be designed for s3 buckets that keep semi-permanent data and do something entirely different. Storage mode would also watch s3 buckets that match the patterns in the config and bucket manifest, but then perform the upsert, not delete the s3 data files but delete the s3 bucket's manifest (upon successful data load into redshift) such that the bucket is no longer watched to avoid loading this data again. This is useful for loading data buckets that contain dated subfolders (e.g. 2015-01-13) and files for that date therein. That way the data is not deleted but loaded into redshift, and need not be loaded again.
Other modes could potentially be created if necessary or desired.
I haven't implemented this yet, but am considering it.
My questions are:
Thanks for your thoughts.
-Avram