Refactors the file reader and adds s3 file reader as an alternative to local file reads.
New commands:
data_integration.parallel_tasks.files.ParallelReadS3File: reads in a whole bucket
data_integration.commands.files.ReadS3File: reads a single file from S3
From initial testing, this is a lot slower than sync + reading from a local file system (both iterating over the bucket to get the file list and the individual reads...) but then syncing that bucket to a local filesystem is also taking time... From my perspective this is only worth it if you have to do a "sync to local" every time (which we have to do, not volumns in our ETL container :-(), so the second run is then saving time compared to doing a sync + incremental read via file system. That's at least the theory, up to now I only tested locally.
The single file read will also come in handy as a replacement of google sheet imports.
Refactors the file reader and adds s3 file reader as an alternative to local file reads.
New commands:
data_integration.parallel_tasks.files.ParallelReadS3File
: reads in a whole bucketdata_integration.commands.files.ReadS3File
: reads a single file from S3From initial testing, this is a lot slower than sync + reading from a local file system (both iterating over the bucket to get the file list and the individual reads...) but then syncing that bucket to a local filesystem is also taking time... From my perspective this is only worth it if you have to do a "sync to local" every time (which we have to do, not volumns in our ETL container :-(), so the second run is then saving time compared to doing a sync + incremental read via file system. That's at least the theory, up to now I only tested locally.
The single file read will also come in handy as a replacement of google sheet imports.
WIP...