Open belltailjp opened 2 years ago
It seems that this Python bug is deeply related, which is apparently dealt in Python 3.10~. https://bugs.python.org/issue42853 https://stackoverflow.com/questions/70905872
For pfio, since we cannot drop support for Python 3.8~ right now, I guess we need some workarounds to prevent attempting to read the whole content at once even when _ObjectReader.read(-1)
or _ObjectReader.readall()
is called.
The naive approach would be to modify _ObjectReader.read
to split get_object
API call when necessary, though it sounds like re-implementing kind of a buffering which is duplication with BufferedReader
(#247).
I wonder if there is a way to somehow force BufferedReader
to do buffering when read(-1)
is called, although currently it directly calls _ObjectReader.readall
.
c.f., https://github.com/python/cpython/blob/v3.11.0a5/Lib/_pyio.py#L1096
In that case, we also need to consider "rt"
mode which uses TextIOWrapper
instead of BufferedReader
.
In addition, I guess it would also be preferred to prevent this issue without buffering wrapper (buffering=0
).
# Note: The reported issue reproduces regardless of with/without buffering
option and "rb"
/"rt"
mode.
Strictly speaking, 42853 was fixed in Python 3.9.7 (release note). I knew this issue in January, but I didn't report here, sorry! I thought at that time reading a fairly large file (>2GB) at once was a bit rare use case so that it doesn't pay implementing a workaround for the issue. Regarding the fact you reported here, did you find this issue in your application?
Python 3.8 EoL is scheduled in 2024-10. It'll be more than two years from today, and 3.8 maintenance state is security update only. 42853 isn't a vulnerability, so it' won't be fixed in 3.8 branch. Hmmm....
We just had observed an internal use case of loading large pickle file failure like this:
import pickle
from pfio.v2 import open_url
with open_url("s3://very/large/file.pickle", "rb") as fp:
pickle.load(fp) # Gets the exception
Update: even after 3.9.7, this issue reproduced against loading large ndarray pickled files, possibly because the binary protocol forces to load a large array more than 2G to read at once from SSL. This is fixed in Python 3.10, as it uses SSL_read_ex()
.
Python 3.10 will use SSL_write_ex() and SSL_read_ex(), which support > 2 GB data.
So the complete resolution for this issue is to use Python 3.10. ¯\_(ツ)_/¯
I found out that when reading the entire content of a file of 2+αGiB from S3 fails by
OverflowError: signed integer is greater than maximum
exception raised from Python SSL library.Here is the minimum reproduction.
Reading the error message, I thought reading a file of 2^31 bytes is fine and 2^31+1 bytes is NG, but it seems to be slightly different; the threshold is somewhere between 2147490816 (2^31+7k) ~ 2147491840 (2^31+8k).
I think the S3 API itself should support reading such a large file, but the issue is in Python SSL library layer (if so, maybe it'd be better trying Python 3.9 and 3.10).
Here is my environment: