mara / mara-pipelines

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
MIT License
2.07k stars 100 forks source link

S3 file reader support #32

Open jankatins opened 4 years ago

jankatins commented 4 years ago

Refactors the file reader and adds s3 file reader as an alternative to local file reads.

New commands:

From initial testing, this is a lot slower than sync + reading from a local file system (both iterating over the bucket to get the file list and the individual reads...) but then syncing that bucket to a local filesystem is also taking time... From my perspective this is only worth it if you have to do a "sync to local" every time (which we have to do, not volumns in our ETL container :-(), so the second run is then saving time compared to doing a sync + incremental read via file system. That's at least the theory, up to now I only tested locally.

The single file read will also come in handy as a replacement of google sheet imports.

WIP...

martin-loetzsch commented 3 years ago

@jankatins is this running in production?

jankatins commented 3 years ago

@martin-loetzsch Nope, should also be integrated into https://github.com/mara/mara-storage where this looks much easier to do.