Open dw opened 5 years ago
Main thread (Py3):
.
Main thread (Py2)
.
Broker thread (Py3):
.
Broker thread (Py2):
Broker thread WTFs:
ncalls tottime percall cumtime percall filename:lineno(function)
Py2: 6914 0.249 0.000 0.249 0.000 {posix.read}
Py3: 12186 1.482 0.000 1.482 0.000 {built-in method posix.read}
Main thread WTFs:
ncalls tottime percall cumtime percall filename:lineno(function)
Py2: 4101 0.015 0.000 0.109 0.000 <stdin>:764(unpickle)
Py3: 4103 0.041 0.000 3.354 0.001 <stdin>:764(unpickle)
For the read() mess, this measures no obvious difference in perf output or runtime.
[19:58:22 eldil!17 mitogen] cat wtf.py
import mitogen.core
import os
try:
xrange
except NameError:
xrange = range
fd = os.open('/dev/zero', os.O_RDONLY)
for x in xrange(200000):
os.read(fd, mitogen.core.CHUNK_SIZE)
Could /dev/zero really be slow enough to hide the problem? I think not. I'm suspecting this may be due to the GIL
Python 3 slows down a little bit when an extra thread is present, but only by ~10%ish. Python 3's base overhead is significantly higher, which is obvious by changing the read size to 64/128, but still nowhere nearly enough to explain the profile output
I have a feeling that this could be related: https://github.com/paramiko/paramiko/issues/1141
I have a feeling that this could be related: paramiko/paramiko#1141
I updated the timings in that issue for the latest paramiko but there is little/no difference. I'd love to know if you guys get to the bottom of what this massive difference in performance is.
Great, thank you! I would like to find out if we have a non-linear performance curve here with mitogen just like Paramiko. I think that would be very telling. If this is only a linear decrease, it seems different (though maybe still partly related).
The Paramiko problem was separate, Mitogen doesn't actually use Paramiko at all.
The problem here is fundamental to a permanent compatibility hack in Python 3.x, that basically renders cPickle completely unsuitable for transporting binary data any more.
The problem is that when communicating using Pickle protocol v2 or older (i.e. with Python 2), Python 3 represents all 8-bit bytestrings in the pickle by transliterating them to Unicode using the LATIN-1 codec, and undoing the result in the remote side. Unicode in turn is represented in pickle as UTF-8, so every chunk of raw data is doubly transcoded (and copied) during serialization and deserialization.
Basically this can't be fixed -- the compatibility hack is needed to support yet more compatibility hacks elsewhere in pickle to make it appear like 2.x and 3.x pickles are compatible.
That means every 8-bit bytestring experiences on average 1.5x size expansion on the wire, and that size is variable depending on the actual data bytes being transmitted (thus making it unpredictable, and impossible to correctly align to frame boundaries), and every bytestring must roundtrip first through the UTF-8 decoder then the LATIN-1 decoder every time it is deserialiezd.
It looks like this ticket killed Mitogen's use of pickle. Now we need to consider replacing it with a custom serializer with much higher priority than before.
i recall that this was the reason why exec-net eventually made a own serializer, pickle just turned out to be a liability
personally i would like to see some kind of layering, so each channel/transport endpoint can have a own serializer setup, allowing to configure for raw data, json, pickle, and others instead of enforcing a specific serializer
I'm generally anti-modularity unless a use case makes absolute sense. One of the problems discovered here, broken frame alignment, is alone likely responsible for 25%+ of the slowdown. Optimizations like that require precise knowledge of lower layers, which can be hard to maintain through an abstraction. In other words, particularly in this case, I believe it is much better to have a single option that works very well for many cases, than many options catering poorly to every case
Wow, thank you for digging into this so much. I appreciate it!
The problem here is fundamental to a permanent compatibility hack in Python 3.x, that basically renders cPickle completely unsuitable for transporting binary data any more.
I stumbled across this too. As a shot in the dark, since Mitogen controls both the pickling and unpickling, the extension registry could be used to implement something without throwing out all of pickle. I need to think about that.
It looks like this ticket killed Mitogen's use of pickle. Now we need to consider replacing it with a custom serializer with much higher priority than before.
AFAICT the options aren't great. Whatever is chosen needs to be implemented in C, for any hope of speed, and that means binary dependencies. That will be hard enough across 2.7, 3.5, 3.6. Adding 2.6 makes it harder, adding 2.4 or 2.5 verges on impossible.
I'll try an integration of pikl, but currently it will suffer the same problem as discussed here. If I can get the time and perseverance I think a Rust based implementation of pikl, that implemented protocol 3 (or higher) on all supported Python versions would be a win. I already tried backporting the _pickle
module from 3.6 and 3.7 to 3.5 or earlier, without success. Alternatively I might be able to get a Cython module, derived from pickle.py
within 10x speed of _pickle.c
Finally, I've been meaning to look at the format CBOR more closely (and the package cbor2) as possible basis to work on. However IIRC, it can't round trip tuples in it's present form.
In case anyone needs some for comparison, I made pickles with various CPython versions https://github.com/moreati/pickle-fuzz/tree/master/pickles
the extension registry could be used to implement something without throwing out all of pickle
I don't think this buys us anything.
However, zodbpickle may be good enough, at least as a stopgap. The latest release supports CPython 2.7 & 3.4-3.7, a release from 2015 supports CPython 2.6. Critically it supports Pickle protocol 3 on Python 2.x, and it has a mechanism for pickling Python 2.x str
as a byte string, rather than a textual string. E.g.
bpython version 0.18 on top of Python 2.7.15+ /home/alex/src/pikl/v27/bin/python2
>>> import zodbpickle, zodbpickle.fastpickle, zodbpickle.pickletools_2
>>> zodbpickle.pickletools_2.dis(b'\x80\x03C\x03abc.')
0: \x80 PROTO 3
2: C SHORT_BINBYTES 'abc'
7: . STOP
highest protocol among opcodes = 3
>>> zodbpickle.fastpickle.loads(b'\x80\x03C\x03abc.')
'abc'
>>> zodbpickle.fastpickle.dumps('abc', protocol=3)
'\x80\x03U\x03abcq\x01.'
>>> zodbpickle.fastpickle.dumps(zodbpickle.binary('abc'), protocol=3)
'\x80\x03C\x03abc.'
Edit 1: pikl is currently forked from zodbpickle Edit 2: zodpickle 1.0.4 has pre-compiled wheels for CPython {2.7, 3.4, 3.5, 3.6, 3.7} x {i686, x86_64} x {linux, windows}; and MacOS build(s) covering the same CPython versions, but not every permutation of processor architecture. Edit 3: zodbpickle also includes the intimate knowledge about module renames from Python 2.x -> 3.x, that pickle has.
I'm having a play with your pickle-fuzz repo at the moment :) Feeling slightly more optimistic after a few hours to think about it, at least in the case of file transfer, it is possible to simply remove the use of pickle -- it is only there to allow the file sender to abort the transfer, which could simply be signalled some other way. That would fix the most egregious inefficiency, but only as a stopgap measure
Regarding a replacement for pickle, losing e.g. tuples might not be the end of the world, we could simply bump the major version number. It's more important at least for me that whatever serialization used is subjectively "convenient" and does not regularly feature prominently in profile output. Really the bar isn't too high at this point -- we could potentially implement some serialization in bash and it'd still beat what we have on Python 3 right now :smiley:
I haven't profiled yet, but this transcoding madness is possibly a large part of why Mitogen for Ansible has always been slower on 3.x -- we rely on efficient binary serialization for transfer of all kinds of data (like Ansible modules) in the extension
To be clear, pickle-fuzz is a dumping ground of experiments and half finished prototypes. The output of all that is pikl.
Hey @moreati, I pushed a branch with a first pass at this, if you fancy taking a look. The semantics change likely needs a version number bump, so fixing this could be the first patch to 0.3.x.
Random thoughts:
Although the draft is fairly efficient on CPU, on the wire it's pretty wasteful:
>>> len(mitogen.core.encode((1, 2, 3, 4)))
25
>>> len(pickle.dumps((1, 2, 3, 4), protocol=2))
15
It's because all the container types (and string) use fixed 32bit integers for lengths, and there is nothing smaller than a 32bit integer. The more specializations that are added the slower everything will get, but it is tempting (either now or later) to add e.g.
Thatt'll likely blow out the bootstrap a little more. I guess the question is use cases. For the Ansible extension since SSH compression was always used, the fixed integer types will have basically no effect on the wire, but what about other cases?
Looking
FWIW I think this draft is ugly as sin :) It's motivated as usual by file size, but I don't even think that's a pressing concern.. getting serialization right is important enough to spend bytes on. I think the size bar for core.py is probably something like 40kb (hopefully it never happens), then it's time to start really worrying, so there is plenty of room to play with if it produces a nicer solution.
Initial thoughts
_codes.encode
. I'll do a mockup this weekendBike shed
dumps
/loads
, or serialize
/deserialize
. I think encode
/decode
would cause too much confusion with character encodings such as UTF-8.+1 on dumps/loads.
I forgot about pencode! It preserves serialization of cyclical structures, handles long ints and is already the right size.
Could we overcome this LATIN1 transcoding problem with pikl? There would need to be some way to ship any solution to 3.x targets since they are the root of the problem, so I think it either needs to be part of core.py, or there must be a better way to ship additional modules alongside core.py during bootstrap.
I forgot about pencode! It preserves serialization of cyclical structures, handles long ints and is already the right size.
pencode supports Python 2.7 (I mistakenly thought not), adding Python 2.4 support looks doable.
Could we overcome this LATIN1 transcoding problem with pikl?
I need to investigate further. Until this moment I thought that would require upgrading pikl to pickle protocol 4. Looking again, it might be fine with the existing protocol 3 support. If protocol 4 were necessary it would require a pure Python implementation on Python 2.x, with the associated CPU hit.
Research progress:
pencode.pdecode()
as pencode_read.pdecode()
. Reduced from 145 ms, to 85 ms.
mrna.encode()
(arbitary name I gave to your format @dw), but only shaved a few milliseconds.marshal.dumps()
(discussion below)
Benchmark | CPython 2.7 | CPython 3.7 |
---|---|---|
cpickle,proto=2,dumps | 21.5 ms +- 0.7 ms | 8.82 ms +- 0.20 ms |
cpickle,proto=2,loads | 18.1 ms +- 0.5 ms | 8.94 ms +- 0.25 ms |
cpickle,proto=3,dumps | 7.99 ms +- 0.17 ms | |
cpickle,proto=3,loads | 8.94 ms +- 0.25 ms | |
pencode,proto=None,dumps | 50.3 ms +- 1.3 ms | 51.1 ms +- 1.2 ms |
pencode,proto=None,loads | 142 ms +- 7 ms | 132 ms +- 7 ms |
pencode_read,proto=None,dumps | 50.2 ms +- 0.8 ms | 51.6 ms +- 0.7 ms |
pencode_read,proto=None,loads | 78.9 ms +- 1.8 ms | 79.0 ms +- 1.2 ms |
marshal,proto=1,dumps | 4.69 ms +- 0.05 ms | 4.97 ms +- 0.07 ms |
marshal,proto=1,loads | 12.7 ms +- 0.3 ms | 6.84 ms +- 0.24 ms |
mrna,proto=None,dumps | 85.0 ms +- 0.8 ms | 75.2 ms +- 1.3 ms |
mrna,proto=None,loads | 138 ms +- 3 ms | 99.7 ms +- 1.9 ms |
marshal is very intriguing. Pros
bytes
clean, binary strings go in and binary strings go out, no LATIN1 transcoding AFAICTCons
I love the name mrna :)
I'd like to do some archeology and see what kept execnet away from marshal. It's definitely had historical security issues like stack/integer overflows, the question is whether they're relevant here.
If we support talking to Python 2.4, and 2.4 has a marshal module with some kind of RCE in it, it's fine so long as any trusted parent is running a patched marshal, since parent->child RCE is the whole point of the library. But e.g. going host A -> B -> C where bastion B is running an ancient Python, this cannot work where any response from C is interpreted by B (e.g. something like REQUEST_MODULE).
It'd be great to use this opportunity to eliminate any fear surrounding the serializer entirely. I think the strongest contender so far is pencode -- but its support for cyclical structures opens trusted code to algorithmic attacks (infinite recursion or loops), so the icing on the cake (I think) would be pencode minus memoization.
An ominous message from the execnet sources (earliest commit in the current Git):
# XXX marshal.dumps doesn't work for exchanging data across Python
# version :-((( There is no sane solution, short of a custom
# pure Python marshaller
This may be referring to pre-2.4 Pythons, though
Very alpha pickle protocol 4 backport https://github.com/moreati/pickle4
Very alpha pickle protocol 4 backport https://github.com/moreati/pickle4
Fixed up the pickle4 backport enough to benchmark on Python 2.7 and 3.7. It's slow. There's a lot of branching, in pure python. Learned some ancient Python lore
Next, I'm going to look at pencode some more.
pencode_read5 is a variant of pencode. It decodes between 2 and 3 times faster than plain pencode. This brings it within 10x of cpickle.
Benchmark | CPython 2.7 | CPython 3.6 |
---|---|---|
cpickle,proto=2,dumps | 21.1 ms +- 0.5 ms | 9.09 ms +- 0.14 ms |
cpickle,proto=2,loads | 19.0 ms +- 0.9 ms | 9.71 ms +- 0.22 ms |
pencode,proto=None,dumps | 49.7 ms +- 0.5 ms | 64.0 ms +- 0.7 ms |
pencode,proto=None,loads | 141 ms +- 7 ms | 194 ms +- 13 ms |
pencode_read5,proto=None,dumps: | 49.0 ms +- 0.8 ms | 64.1 ms +- 0.8 ms |
pencode_read5,proto=None,loads | 59.9 ms +- 0.9 ms | 78.0 ms +- 8.1 ms |
To do this pencode_read5 removes the dedicated the op codes for None
, True
, and False
. The uncompressed wire size becomes slightly larger as a result
Short to medium term ideas
Recorded using
tests/bench/throughput.py
, either running withpython_path=["perf", "record", "-g", "python"]
/ perf report, or withenable_profiling=True
to capture the cProfile output. All methods except.local()
are commented outSome insane string encoding somewhere. Py2:
Py3:
Traces are from receiver only