Open ejguan opened 1 year ago
And, there is a use case that might affect the performance on S3FileLoader.
If I do tarfile.open(fileobj=s3_stream_returned_from_s3fileloader, mode=m, bufsize=20000000240)
, the speed with mode r:
is way faster than the mode r|
cc: @ydaiming for confirmation about the files are dumped into memory rather than streaming from S3Handler.s3_read
. And, do you want to see if there will be benefit to revamp it to an iostream?
Originally I was expecting the returned stream from
S3handler
is non-seekable stream. But, it turns out that the whole archive/files will be dumped into memory based on the implementation (I might be wrong about it then I need someone to validate it)BytesIO
in this issue.In order to make it streaming, we need to have a way to pybind C++ stream IO to python, which is non-trivial. See a code example: https://github.com/CadQuery/OCP/blob/master/pystreambuf.h
Potentially this change would accelerate data preprocessing. But, it needs to be extensively benchmarked.