Open bhtucker opened 4 years ago
Makes sense. I'm not 100% sure that data lake vs. object store is applied consistently. In general, transient files from running the ETL like the schemas
and data
directory should be in the object store. Unloads should go into a "data lake" where they can be consumed by others systems. One could argue that the extracts also should go into the "data lake".
Summary
Extract for database targets doesn't support the power config rendering that's available for static sources.
In
extract
, the output target directory comes fromrelation.data_directory
, whereas for static sources and unloads, the schema-level path template is used.Details
Both systems get at the same 'universe' of remote data file/directory addresses:
Unload:
Sqoop:
where data_directory is:
The Unload formulation is a bit more powerful. By moving
extract
targets onto the render-based system, the same 'archiving' use case (e.g. retain daily snapshots of relations using today/yesterday config values) that templating supports in unload can be done directly from upstream DBs at extract time.I also see
data_lake
is in the config and seems related but didn't quite see how it fits in. Hopefully this could be involved in the 'harmonization' of these two systems in such a way as to allow configuration of the storage backend forextract
/unload
between e.g. GCS vs S3.Labels Please set the label on the issue so that
feature component: extract