[Experiment] Make S3Handler.s3_read return a stream rather than bytes

ejguan commented 1 year ago

Originally I was expecting the returned stream from S3handler is non-seekable stream. But, it turns out that the whole archive/files will be dumped into memory based on the implementation (I might be wrong about it then I need someone to validate it)

C++ side: https://github.com/pytorch/data/blob/a435c7f5f7543e3614130eb2ecee520396b8efd2/torchdata/csrc/pybind/pybind.cpp#L28
Python side: https://github.com/pytorch/data/blob/a435c7f5f7543e3614130eb2ecee520396b8efd2/torchdata/datapipes/iter/load/s3io.py#L135 And, that is the reason that the performance seems on parity with or without BytesIO in this issue.

In order to make it streaming, we need to have a way to pybind C++ stream IO to python, which is non-trivial. See a code example: https://github.com/CadQuery/OCP/blob/master/pystreambuf.h

Potentially this change would accelerate data preprocessing. But, it needs to be extensively benchmarked.

ejguan commented 1 year ago

And, there is a use case that might affect the performance on S3FileLoader. If I do tarfile.open(fileobj=s3_stream_returned_from_s3fileloader, mode=m, bufsize=20000000240), the speed with mode r: is way faster than the mode r|

ejguan commented 1 year ago

cc: @ydaiming for confirmation about the files are dumped into memory rather than streaming from S3Handler.s3_read. And, do you want to see if there will be benefit to revamp it to an iostream?

pytorch / data

[Experiment] Make S3Handler.s3_read return a stream rather than bytes #800