irmen / pickle

Java and .NET implementation of Python's pickle serialization protocol
MIT License
78 stars 5 forks source link

Should `put_long` use LONG1 encoding for values than Integer.MAX_VALUE? #9

Closed JoshRosen closed 2 years ago

JoshRosen commented 2 years ago

Pickler's put_long method currently falls back on the text-based INT encoding if the long value is too large to be represented as a 4-byte signed integer.

Instead, I'm wondering whether it should use the LONG1 encoding and write it as an 8-byte signed integer. Since this method's parameter is a long I think all of the values should fit in a LONG1. My understanding is that LONG1 should be more time- and space-efficient for these values. Pyrolite already uses LONG1 encoding when writing BigIntegers.


If I use Pyrolite to do pickler.dumps(9223372036854775807L) (which is Long.MAX_VALUE), pickletools disassembles the result as:

    0: \x80 PROTO      2
    2: I    INT        9223372036854775807
   23: .    STOP
highest protocol among opcodes = 2

This matches Python 2.7's behavior.

In contrast, Python 3.7 pickles this value using LONG1 (which requires nearly half the space):

>>> pickletools.dis(pickle.dumps(9223372036854775807, protocol=2))
    0: \x80 PROTO      2
    2: \x8a LONG1      9223372036854775807
   12: .    STOP
highest protocol among opcodes = 2
irmen commented 2 years ago

Agreed. If Python's protocol 2 pickle uses it, we should too.

irmen commented 2 years ago

this is now avaiable in Pickle release 1.3.