msgpack / msgpack-python

MessagePack serializer implementation for Python msgpack.org[Python]
https://msgpack.org/
Other
1.92k stars 230 forks source link

The Unpacker fails to retrieve and unpack all the data while streaming with big data. #578

Closed MasahiroYasumoto closed 8 months ago

MasahiroYasumoto commented 10 months ago

The Unpacker fails to retrieve and unpack all the data while streaming with a big data (e.g. 10GiB).

td-client-python uses msgpack-python internally to unpack the receiving data while streaming. https://github.com/treasure-data/td-client-python/blob/1.2.1/tdclient/job_api.py#L220-L244

When the size of this file is 10GiB or above, I occasionally face the problem that the Unpacker fails to retrieve and unpack all the data while streaming, which result in premature termination without raising an error.

As a workaround, I rewrote the code as follows to first receive all the data, save it to a file, and unpack it from there, which seems to have solved the problem. Thus, I suspect this is a bug in Unpacker's handling of streaming input.

with open("temp.mpack", "wb") as output_file:
    for chunk in res.stream(1024*1024*1024):
        if chunk:
            output_file.write(chunk)

with open("temp.mpack", "rb") as input_file:
    unpacker = msgpack.Unpacker(input_file, raw=False)
    for row in unpacker:
        yield row
methane commented 10 months ago

Unpacker can handle the file means Unpacker can handle >10GiB data. Without reproduceer, I can not fix your issue.

Maybe, res object in your code has some file-unlike behavior. (I don't know what is self.get() and what is res in your code). I recommend to use Unpacker.feed() method. You can be freed from "file-like" edge cases.

https://github.com/msgpack/msgpack-python/blob/140864249fd0f67dffaeceeb168ffe9cdf6f1964/msgpack/_unpacker.pyx#L291-L300

MasahiroYasumoto commented 10 months ago

Thank you for your quick response! I'll try Unpacker.feed() and see if it can fix the problem.