Closed ciorg closed 1 year ago
Revisiting this issue and adding some details it's not so much the number of files or size of data to read but how long it takes to slice the data that matters. If it takes too long to slice the data then the works will time out.
The file_reader waits until all the data is sliced before processing it. The time it takes to split the files is based on the amount of data to split as well as the size
setting in the opConfig.
Setting the config.size
to a larger number like 5,000,000
allows the file_reader to chunk the data faster, but may not be ideal for other processes down stream. A smaller number like 100,000
takes longer for the file_reader to split the data and increases the likely hood that a timeout will occur, but may be better for downstream processes.
The size
setting is in bytes, so it's hard to know how that translates into records per data set.
This isn't a bug, but more of a caution to not use small sizes in the job.
job reading 2743 objects in an s3 bucket caused a timeout error:
Causing the workers to stop. The job never failed, the status remained running but the workers never went active and processed any data.
I think the issue is here: https://github.com/terascope/file-assets/blob/0fa094d0b6cd0e70e12c48860b6a1ef69792dfb8/packages/file-asset-apis/src/s3/s3-slicer.ts#L66
Could be getting hung up slicing all the objects in the bucket and then flattening the array.
The data was ldjson and I was able to process the data in chunks, the largest chunk was about 700 files. The files were all different sizes, the largest being about 8.5gb and the smallest under a kb.