terascope / file-assets

Teraslice processors for working with data stored in files on disk, S3 or HDFS.
MIT License
1 stars 2 forks source link

s3file reader cannot read a large number of s3 objects #740

Closed ciorg closed 1 year ago

ciorg commented 2 years ago

job reading 2743 objects in an s3 bucket caused a timeout error:

timeout when connecting to ExecutionController at http://10.36.24.137:45680

Causing the workers to stop. The job never failed, the status remained running but the workers never went active and processed any data.

I think the issue is here: https://github.com/terascope/file-assets/blob/0fa094d0b6cd0e70e12c48860b6a1ef69792dfb8/packages/file-asset-apis/src/s3/s3-slicer.ts#L66

Could be getting hung up slicing all the objects in the bucket and then flattening the array.

The data was ldjson and I was able to process the data in chunks, the largest chunk was about 700 files. The files were all different sizes, the largest being about 8.5gb and the smallest under a kb.

ciorg commented 1 year ago

Revisiting this issue and adding some details it's not so much the number of files or size of data to read but how long it takes to slice the data that matters. If it takes too long to slice the data then the works will time out.

The file_reader waits until all the data is sliced before processing it. The time it takes to split the files is based on the amount of data to split as well as the size setting in the opConfig.

Setting the config.size to a larger number like 5,000,000 allows the file_reader to chunk the data faster, but may not be ideal for other processes down stream. A smaller number like 100,000 takes longer for the file_reader to split the data and increases the likely hood that a timeout will occur, but may be better for downstream processes.

The size setting is in bytes, so it's hard to know how that translates into records per data set.

ciorg commented 1 year ago

This isn't a bug, but more of a caution to not use small sizes in the job.