S3 file reader support - Githubissues

jankatins commented 4 years ago

Refactors the file reader and adds s3 file reader as an alternative to local file reads.

New commands:

data_integration.parallel_tasks.files.ParallelReadS3File: reads in a whole bucket
data_integration.commands.files.ReadS3File: reads a single file from S3

From initial testing, this is a lot slower than sync + reading from a local file system (both iterating over the bucket to get the file list and the individual reads...) but then syncing that bucket to a local filesystem is also taking time... From my perspective this is only worth it if you have to do a "sync to local" every time (which we have to do, not volumns in our ETL container :-(), so the second run is then saving time compared to doing a sync + incremental read via file system. That's at least the theory, up to now I only tested locally.

The single file read will also come in handy as a replacement of google sheet imports.

WIP...

martin-loetzsch commented 3 years ago

@jankatins is this running in production?

jankatins commented 3 years ago

@martin-loetzsch Nope, should also be integrated into https://github.com/mara/mara-storage where this looks much easier to do.

mara / mara-pipelines

S3 file reader support #32