pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.12k stars 149 forks source link

[Experiment] Make S3Handler.s3_read return a stream rather than bytes #800

Open ejguan opened 1 year ago

ejguan commented 1 year ago

Originally I was expecting the returned stream from S3handler is non-seekable stream. But, it turns out that the whole archive/files will be dumped into memory based on the implementation (I might be wrong about it then I need someone to validate it)

In order to make it streaming, we need to have a way to pybind C++ stream IO to python, which is non-trivial. See a code example: https://github.com/CadQuery/OCP/blob/master/pystreambuf.h

Potentially this change would accelerate data preprocessing. But, it needs to be extensively benchmarked.

ejguan commented 1 year ago

And, there is a use case that might affect the performance on S3FileLoader. If I do tarfile.open(fileobj=s3_stream_returned_from_s3fileloader, mode=m, bufsize=20000000240), the speed with mode r: is way faster than the mode r|

ejguan commented 1 year ago

cc: @ydaiming for confirmation about the files are dumped into memory rather than streaming from S3Handler.s3_read. And, do you want to see if there will be benefit to revamp it to an iostream?