Python 3: ~100MB/sec vs. 800MB/sec FileService throughput

dw commented 5 years ago

Recorded using tests/bench/throughput.py , either running with python_path=["perf", "record", "-g", "python"] / perf report, or with enable_profiling=True to capture the cProfile output. All methods except .local() are commented out

18:07:51 eldil!3 mitogen] python tests/bench/throughput.py 
Testing local()...
local() took 553.16 ms to transfer 512.00 MiB, 925.59 MiB/s
Testing sudo()...
sudo() took 523.43 ms to transfer 512.00 MiB, 978.17 MiB/s
Testing ssh(compression=False)...
ssh(compression=False) took 2313.06 ms to transfer 512.00 MiB, 221.35 MiB/s
Testing ssh(compression=True)...
ssh(compression=True) took 20117.23 ms to transfer 512.00 MiB, 25.45 MiB/s

[18:08:20 eldil!4 mitogen] PYTHONPATH=. python3 tests/bench/throughput.py 
Testing local()...
local() took 4608.32 ms to transfer 512.00 MiB, 111.10 MiB/s
Testing sudo()...
sudo() took 5030.14 ms to transfer 512.00 MiB, 101.79 MiB/s
Testing ssh(compression=False)...
ssh(compression=False) took 8944.76 ms to transfer 512.00 MiB, 57.24 MiB/s
Testing ssh(compression=True)...
ssh(compression=True) took 75505.49 ms to transfer 512.00 MiB, 6.78 MiB/s

Some insane string encoding somewhere. Py2:

screenshot from 2019-01-26 19-08-33

Py3:

screenshot from 2019-01-26 19-08-35

Traces are from receiver only

dw commented 5 years ago

Main thread (Py3):

Log

````Sat Jan 26 19:33:37 2019 mitogen.28742.main.pstat 395829 function calls (395708 primitive calls) in 4.936 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 4.936 4.936 :2991(_dispatch_calls) 2 0.000 0.000 4.934 2.467 :2974(_dispatch_one) 1 0.000 0.000 4.887 4.887 master:tests/bench/throughput.py:18(transfer) 1 0.102 0.102 4.887 4.887 master:/home/dmw/src/mitogen/mitogen/service.py:987(get) 4103 0.041 0.000 3.354 0.001 :764(unpickle) 4103 3.195 0.001 3.310 0.001 {method 'load' of '_pickle.Unpickler' objects} 4104 0.008 0.000 1.136 0.000 :998(__iter__) 4105 0.017 0.000 1.129 0.000 :971(get) 4105 0.038 0.000 1.110 0.000 :2102(get) 4022 0.052 0.000 1.008 0.000 :2147(_get_sleep) 8001 0.009 0.000 0.898 0.000 :442(io_op) 8044 0.026 0.000 0.897 0.000 :1953(_poll) 4022 0.862 0.000 0.862 0.000 {method 'poll' of 'select.poll' objects} 4101 0.017 0.000 0.219 0.000 :1773(call_service_async) 4101 0.019 0.000 0.117 0.000 :1744(send_async) 4100 0.005 0.000 0.103 0.000 :667(_unpickle_bytes) 4100 0.098 0.000 0.098 0.000 {built-in method _codecs.latin_1_encode} 4104 0.008 0.000 0.070 0.000 :711(pickled) 4102 0.003 0.000 0.066 0.000 :1784(send) 4104 0.004 0.000 0.063 0.000 :2715(route) 4104 0.011 0.000 0.059 0.000 :2307(defer) 4104 0.047 0.000 0.050 0.000 {built-in method _pickle.dumps} ````

.

Main thread (Py2)

Log

```` 345621 function calls (345607 primitive calls) in 0.923 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.922 0.922 :2991(_dispatch_calls) 2 0.000 0.000 0.920 0.460 :2974(_dispatch_one) 1 0.000 0.000 0.878 0.878 master:tests/bench/throughput.py:18(transfer) 1 0.033 0.033 0.877 0.877 master:/home/dmw/src/mitogen/mitogen/service.py:987(get) 4102 0.004 0.000 0.397 0.000 :998(__iter__) 4103 0.010 0.000 0.394 0.000 :971(get) 4103 0.024 0.000 0.383 0.000 :2102(get) 2889 0.024 0.000 0.318 0.000 :2147(_get_sleep) 5689 0.006 0.000 0.270 0.000 :442(io_op) 5778 0.010 0.000 0.246 0.000 :1953(_poll) 2889 0.233 0.000 0.233 0.000 {built-in method poll} 4099 0.014 0.000 0.218 0.000 :1773(call_service_async) 4099 0.015 0.000 0.126 0.000 :1744(send_async) 4101 0.015 0.000 0.109 0.000 :764(unpickle) 4101 0.083 0.000 0.086 0.000 {built-in method load} 4100 0.003 0.000 0.082 0.000 :1784(send) 4102 0.004 0.000 0.079 0.000 :2715(route) 4102 0.013 0.000 0.075 0.000 :2307(defer) 4098 0.073 0.000 0.073 0.000 {method 'write' of 'file' objects} 15201 0.066 0.000 0.066 0.000 {method 'acquire' of 'thread.lock' objects} 4102 0.008 0.000 0.061 0.000 :711(pickled) 2800 0.005 0.000 0.043 0.000 :2289(_wake) 2 0.000 0.000 0.043 0.021 :2958(_parse_request) 2 0.000 0.000 0.043 0.021 :553(import_module) 3/2 0.000 0.000 0.043 0.021 {__import__} 3/1 0.000 0.000 0.043 0.043 :1274(load_module) 4102 0.038 0.000 0.042 0.000 {cPickle.dumps} 2800 0.005 0.000 0.039 0.000 :1454(write) 2889 0.037 0.000 0.037 0.000 {method 'recv' of '_socket.socket' objects} 3 0.000 0.000 0.032 0.011 /usr/lib/python2.7/threading.py:597(wait) 1 0.000 0.000 0.032 0.032 /usr/lib/python2.7/threading.py:309(wait) 2800 0.031 0.000 0.031 0.000 {posix.write} 4098 0.006 0.000 0.028 0.000 :950(close) 4100 0.009 0.000 0.025 0.000 :893(__init__) 2889 0.015 0.000 0.017 0.000 :2091(_make_cookie) 4129 0.004 0.000 0.016 0.000 :321(to_text) 4098 0.008 0.000 0.015 0.000 :2038(close) 4131 0.004 0.000 0.011 0.000 {method 'decode' of 'str' objects} `````

.

Broker thread (Py3):

Log

```` 419014 function calls (419000 primitive calls) in 4.950 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.006 0.006 4.949 4.949 :2891(_broker_main) 9840 0.032 0.000 4.941 0.001 :2845(_loop_once) 26130 0.018 0.000 4.307 0.000 :442(io_op) 16294 0.015 0.000 2.661 0.000 :2834(_call) 26130 0.060 0.000 2.242 0.000 :1953(_poll) 9840 2.172 0.000 2.172 0.000 {method 'poll' of 'select.poll' objects} 8203 0.018 0.000 1.828 0.000 :1573(on_receive) 12186 0.016 0.000 1.506 0.000 :1431(read) 12186 1.482 0.000 1.482 0.000 {built-in method posix.read} 4104 0.016 0.000 0.684 0.000 :1655(on_transmit) 8126 0.657 0.000 0.657 0.000 {built-in method posix.write} 4104 0.009 0.000 0.650 0.000 :1454(write) 8203 0.027 0.000 0.336 0.000 :1563(_internal_receive) 12311 0.112 0.000 0.307 0.000 :1595(_receive_one) 8214 0.024 0.000 0.142 0.000 :2651(_async_route) 3981 0.013 0.000 0.128 0.000 :2266(on_receive) 4108 0.006 0.000 0.079 0.000 :2626(_invoke) 4105 0.005 0.000 0.073 0.000 :938(_on_receive) 4105 0.013 0.000 0.068 0.000 :2189(put) 8091 0.061 0.000 0.061 0.000 {method 'acquire' of '_thread.lock' objects} 4108 0.060 0.000 0.060 0.000 {method 'join' of 'bytes' objects} 4105 0.013 0.000 0.034 0.000 :1675(_send) ````

.

Broker thread (Py2):

Log

```` 268033 function calls (268029 primitive calls) in 0.933 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.003 0.003 0.933 0.933 :2891(_broker_main) 4430 0.016 0.000 0.927 0.000 :2845(_loop_once) 11020 0.010 0.000 0.710 0.000 :2834(_call) 15446 0.015 0.000 0.611 0.000 :442(io_op) 4110 0.006 0.000 0.393 0.000 :1573(on_receive) 6914 0.008 0.000 0.264 0.000 :1431(read) 6914 0.249 0.000 0.249 0.000 {posix.read} 15446 0.024 0.000 0.201 0.000 :1953(_poll) 4102 0.011 0.000 0.199 0.000 :1655(on_transmit) 6991 0.188 0.000 0.188 0.000 {posix.write} 4102 0.006 0.000 0.186 0.000 :1454(write) 4110 0.011 0.000 0.180 0.000 :1563(_internal_receive) 4430 0.171 0.000 0.171 0.000 {built-in method poll} 8216 0.073 0.000 0.167 0.000 :1595(_receive_one) 2802 0.011 0.000 0.102 0.000 :2266(on_receive) 8210 0.022 0.000 0.102 0.000 :2651(_async_route) 4106 0.006 0.000 0.060 0.000 :2626(_invoke) 4103 0.005 0.000 0.054 0.000 :938(_on_receive) 4103 0.012 0.000 0.049 0.000 :2189(put) 6912 0.031 0.000 0.031 0.000 {method 'acquire' of 'thread.lock' objects} 4103 0.011 0.000 0.016 0.000 :1675(_send) 2889 0.003 0.000 0.015 0.000 :2213(_wake) 4119 0.008 0.000 0.013 0.000 :651(__init__) 1 0.000 0.000 0.006 0.006 :2871(_broker_shutdown) 36955 0.004 0.000 0.004 0.000 {len} 4127 0.003 0.000 0.003 0.000 {vars} 23354 0.003 0.000 0.003 0.000 {method 'get' of 'dict' objects} 4103 0.003 0.000 0.003 0.000 {_struct.pack} 4430 0.003 0.000 0.003 0.000 :1970(poll) 2 0.000 0.000 0.003 0.001 :2373(on_receive) 10 0.000 0.000 0.003 0.000 :1421(close) 8 0.003 0.000 0.003 0.000 {posix.close} 6912 0.002 0.000 0.002 0.000 {method 'release' of 'thread.lock' objects} 8215 0.002 0.000 0.002 0.000 {method 'append' of 'list' objects} 4 0.000 0.000 0.002 0.000 :1516(on_disconnect) 4119 0.002 0.000 0.002 0.000 {method 'update' of 'dict' objects} 4112 0.002 0.000 0.002 0.000 {_struct.unpack} ````

dw commented 5 years ago

Broker thread WTFs:

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
Py2:     6914    0.249    0.000    0.249    0.000 {posix.read}
Py3:    12186    1.482    0.000    1.482    0.000 {built-in method posix.read}

Main thread WTFs:

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
Py2:     4101    0.015    0.000    0.109    0.000 <stdin>:764(unpickle)
Py3:     4103    0.041    0.000    3.354    0.001 <stdin>:764(unpickle)

dw commented 5 years ago

For the read() mess, this measures no obvious difference in perf output or runtime.

[19:58:22 eldil!17 mitogen] cat wtf.py 

import mitogen.core
import os

try:
    xrange
except NameError:
    xrange = range

fd = os.open('/dev/zero', os.O_RDONLY)
for x in xrange(200000):
    os.read(fd, mitogen.core.CHUNK_SIZE)

Could /dev/zero really be slow enough to hide the problem? I think not. I'm suspecting this may be due to the GIL

dw commented 5 years ago

Python 3 slows down a little bit when an extra thread is present, but only by ~10%ish. Python 3's base overhead is significantly higher, which is obvious by changing the read size to 64/128, but still nowhere nearly enough to explain the profile output

ghost commented 5 years ago

I have a feeling that this could be related: https://github.com/paramiko/paramiko/issues/1141

lsh-0 commented 5 years ago

I have a feeling that this could be related: paramiko/paramiko#1141

I updated the timings in that issue for the latest paramiko but there is little/no difference. I'd love to know if you guys get to the bottom of what this massive difference in performance is.

ghost commented 5 years ago

Great, thank you! I would like to find out if we have a non-linear performance curve here with mitogen just like Paramiko. I think that would be very telling. If this is only a linear decrease, it seems different (though maybe still partly related).

ghost commented 5 years ago

This might also be related: https://stackoverflow.com/questions/53251448/pandas-read-sql-is-extremely-slow-on-python-3-kernel-compared-to-python-2

dw commented 5 years ago

The Paramiko problem was separate, Mitogen doesn't actually use Paramiko at all.

The problem here is fundamental to a permanent compatibility hack in Python 3.x, that basically renders cPickle completely unsuitable for transporting binary data any more.

The problem is that when communicating using Pickle protocol v2 or older (i.e. with Python 2), Python 3 represents all 8-bit bytestrings in the pickle by transliterating them to Unicode using the LATIN-1 codec, and undoing the result in the remote side. Unicode in turn is represented in pickle as UTF-8, so every chunk of raw data is doubly transcoded (and copied) during serialization and deserialization.

Basically this can't be fixed -- the compatibility hack is needed to support yet more compatibility hacks elsewhere in pickle to make it appear like 2.x and 3.x pickles are compatible.

That means every 8-bit bytestring experiences on average 1.5x size expansion on the wire, and that size is variable depending on the actual data bytes being transmitted (thus making it unpredictable, and impossible to correctly align to frame boundaries), and every bytestring must roundtrip first through the UTF-8 decoder then the LATIN-1 decoder every time it is deserialiezd.

It looks like this ticket killed Mitogen's use of pickle. Now we need to consider replacing it with a custom serializer with much higher priority than before.

RonnyPfannschmidt commented 5 years ago

i recall that this was the reason why exec-net eventually made a own serializer, pickle just turned out to be a liability

personally i would like to see some kind of layering, so each channel/transport endpoint can have a own serializer setup, allowing to configure for raw data, json, pickle, and others instead of enforcing a specific serializer

dw commented 5 years ago

I'm generally anti-modularity unless a use case makes absolute sense. One of the problems discovered here, broken frame alignment, is alone likely responsible for 25%+ of the slowdown. Optimizations like that require precise knowledge of lower layers, which can be hard to maintain through an abstraction. In other words, particularly in this case, I believe it is much better to have a single option that works very well for many cases, than many options catering poorly to every case

ghost commented 5 years ago

Wow, thank you for digging into this so much. I appreciate it!

moreati commented 5 years ago

The problem here is fundamental to a permanent compatibility hack in Python 3.x, that basically renders cPickle completely unsuitable for transporting binary data any more.

I stumbled across this too. As a shot in the dark, since Mitogen controls both the pickling and unpickling, the extension registry could be used to implement something without throwing out all of pickle. I need to think about that.

It looks like this ticket killed Mitogen's use of pickle. Now we need to consider replacing it with a custom serializer with much higher priority than before.

AFAICT the options aren't great. Whatever is chosen needs to be implemented in C, for any hope of speed, and that means binary dependencies. That will be hard enough across 2.7, 3.5, 3.6. Adding 2.6 makes it harder, adding 2.4 or 2.5 verges on impossible.

I'll try an integration of pikl, but currently it will suffer the same problem as discussed here. If I can get the time and perseverance I think a Rust based implementation of pikl, that implemented protocol 3 (or higher) on all supported Python versions would be a win. I already tried backporting the _pickle module from 3.6 and 3.7 to 3.5 or earlier, without success. Alternatively I might be able to get a Cython module, derived from pickle.py within 10x speed of _pickle.c

Finally, I've been meaning to look at the format CBOR more closely (and the package cbor2) as possible basis to work on. However IIRC, it can't round trip tuples in it's present form.

moreati commented 5 years ago

In case anyone needs some for comparison, I made pickles with various CPython versions https://github.com/moreati/pickle-fuzz/tree/master/pickles

moreati commented 5 years ago

the extension registry could be used to implement something without throwing out all of pickle

I don't think this buys us anything.

However, zodbpickle may be good enough, at least as a stopgap. The latest release supports CPython 2.7 & 3.4-3.7, a release from 2015 supports CPython 2.6. Critically it supports Pickle protocol 3 on Python 2.x, and it has a mechanism for pickling Python 2.x str as a byte string, rather than a textual string. E.g.

bpython version 0.18 on top of Python 2.7.15+ /home/alex/src/pikl/v27/bin/python2
>>> import zodbpickle, zodbpickle.fastpickle, zodbpickle.pickletools_2
>>> zodbpickle.pickletools_2.dis(b'\x80\x03C\x03abc.')
    0: \x80 PROTO      3
    2: C    SHORT_BINBYTES 'abc'
    7: .    STOP
highest protocol among opcodes = 3
>>> zodbpickle.fastpickle.loads(b'\x80\x03C\x03abc.')
'abc'
>>> zodbpickle.fastpickle.dumps('abc', protocol=3)
'\x80\x03U\x03abcq\x01.'
>>> zodbpickle.fastpickle.dumps(zodbpickle.binary('abc'), protocol=3)
'\x80\x03C\x03abc.'

Edit 1: pikl is currently forked from zodbpickle Edit 2: zodpickle 1.0.4 has pre-compiled wheels for CPython {2.7, 3.4, 3.5, 3.6, 3.7} x {i686, x86_64} x {linux, windows}; and MacOS build(s) covering the same CPython versions, but not every permutation of processor architecture. Edit 3: zodbpickle also includes the intimate knowledge about module renames from Python 2.x -> 3.x, that pickle has.

dw commented 5 years ago

I'm having a play with your pickle-fuzz repo at the moment :) Feeling slightly more optimistic after a few hours to think about it, at least in the case of file transfer, it is possible to simply remove the use of pickle -- it is only there to allow the file sender to abort the transfer, which could simply be signalled some other way. That would fix the most egregious inefficiency, but only as a stopgap measure

Regarding a replacement for pickle, losing e.g. tuples might not be the end of the world, we could simply bump the major version number. It's more important at least for me that whatever serialization used is subjectively "convenient" and does not regularly feature prominently in profile output. Really the bar isn't too high at this point -- we could potentially implement some serialization in bash and it'd still beat what we have on Python 3 right now :smiley:

dw commented 5 years ago

I haven't profiled yet, but this transcoding madness is possibly a large part of why Mitogen for Ansible has always been slower on 3.x -- we rely on efficient binary serialization for transfer of all kinds of data (like Ansible modules) in the extension

moreati commented 5 years ago

To be clear, pickle-fuzz is a dumping ground of experiments and half finished prototypes. The output of all that is pikl.

dw commented 4 years ago

Hey @moreati, I pushed a branch with a first pass at this, if you fancy taking a look. The semantics change likely needs a version number bump, so fixing this could be the first patch to 0.3.x.

Random thoughts:

Primary goal is efficient binary transfer, wire size isn't quite as important.
Memoization / cyclical structures: expensive to do in Python, creates DoS issues and seemingly never needed (by me) so far
Needs much better tests for truncated / corrupt input, and probably quite a bit more input validation
Per this old post eliminating BytesIO from decode is likely to help a lot, but it means bootstrap size must increase some more (maybe that's not a bad thing, the current draft is pretty YOLO)

dw commented 4 years ago

Although the draft is fairly efficient on CPU, on the wire it's pretty wasteful:

>>> len(mitogen.core.encode((1, 2, 3, 4)))
25
>>> len(pickle.dumps((1, 2, 3, 4), protocol=2))
15

It's because all the container types (and string) use fixed 32bit integers for lengths, and there is nothing smaller than a 32bit integer. The more specializations that are added the slower everything will get, but it is tempting (either now or later) to add e.g.

(LIST|TUPLE|SET|DICT)(8|16|32): containers with smaller size prefix
(BYTES|UNICODE)(8|16|32): same but for strings
(POS|NEG)(8|16|32): same for integers

Thatt'll likely blow out the bootstrap a little more. I guess the question is use cases. For the Ansible extension since SSH compression was always used, the fixed integer types will have basically no effect on the wire, but what about other cases?

moreati commented 4 years ago

Looking

dw commented 4 years ago

FWIW I think this draft is ugly as sin :) It's motivated as usual by file size, but I don't even think that's a pressing concern.. getting serialization right is important enough to spend bytes on. I think the size bar for core.py is probably something like 40kb (hopefully it never happens), then it's time to start really worrying, so there is plenty of room to play with if it produces a nicer solution.

moreati commented 4 years ago

Initial thoughts

Benchmark

Encoding many small objects takes roughly 10 times longer than cpickle. Which is roughly par for the course with every pure Python serializer I've tried.
pencode (part of https://github.com/lordmauve/chopsticks, Python 3.x only) is faster to encode, but slower to decode.
If you're content to pay the CPU cost of a pure Python serializer, perhaps pikl would be a contender. It would preserve the ability to serialize large integers, frozenset etc and since it is based on pickle protocol 3 it shouldn't incur the cost of _codes.encode. I'll do a mockup this weekend

Bike shed

I'd prefer dumps/loads, or serialize/deserialize. I think encode/decode would cause too much confusion with character encodings such as UTF-8.

dw commented 4 years ago

+1 on dumps/loads.

I forgot about pencode! It preserves serialization of cyclical structures, handles long ints and is already the right size.

Could we overcome this LATIN1 transcoding problem with pikl? There would need to be some way to ship any solution to 3.x targets since they are the root of the problem, so I think it either needs to be part of core.py, or there must be a better way to ship additional modules alongside core.py during bootstrap.

moreati commented 4 years ago

I forgot about pencode! It preserves serialization of cyclical structures, handles long ints and is already the right size.

pencode supports Python 2.7 (I mistakenly thought not), adding Python 2.4 support looks doable.

Could we overcome this LATIN1 transcoding problem with pikl?

I need to investigate further. Until this moment I thought that would require upgrading pikl to pickle protocol 4. Looking again, it might be fine with the existing protocol 3 support. If protocol 4 were necessary it would require a pure Python implementation on Python 2.x, with the associated CPU hit.

Research progress:

I optimized pencode.pdecode() as pencode_read.pdecode(). Reduced from 145 ms, to 85 ms.
- https://github.com/moreati/pickle-fuzz/commit/9cfe89be18e401df0f2e15936c63c15d1f0bf14b
I optimized mrna.encode() (arbitary name I gave to your format @dw), but only shaved a few milliseconds.
I experimented with marshal.dumps() (discussion below)
- https://github.com/moreati/pickle-fuzz/commit/2093dc183b04449ecea50f6924000c02b8df45f9
- https://github.com/moreati/pickle-fuzz/commit/66cf0173f57249474835e0b55686ba3fdf48988f

Benchmark	CPython 2.7	CPython 3.7
cpickle,proto=2,dumps	21.5 ms +- 0.7 ms	8.82 ms +- 0.20 ms
cpickle,proto=2,loads	18.1 ms +- 0.5 ms	8.94 ms +- 0.25 ms
cpickle,proto=3,dumps		7.99 ms +- 0.17 ms
cpickle,proto=3,loads		8.94 ms +- 0.25 ms
pencode,proto=None,dumps	50.3 ms +- 1.3 ms	51.1 ms +- 1.2 ms
pencode,proto=None,loads	142 ms +- 7 ms	132 ms +- 7 ms
pencode_read,proto=None,dumps	50.2 ms +- 0.8 ms	51.6 ms +- 0.7 ms
pencode_read,proto=None,loads	78.9 ms +- 1.8 ms	79.0 ms +- 1.2 ms
marshal,proto=1,dumps	4.69 ms +- 0.05 ms	4.97 ms +- 0.07 ms
marshal,proto=1,loads	12.7 ms +- 0.3 ms	6.84 ms +- 0.24 ms
mrna,proto=None,dumps	85.0 ms +- 0.8 ms	75.2 ms +- 1.3 ms
mrna,proto=None,loads	138 ms +- 3 ms	99.7 ms +- 1.9 ms

marshal is very intriguing. Pros

Very fast, faster even than cPickle/_pickle
Very available, builtin to stdlib
Stable wire format across Python versions, protocol 1 is supported by Python 2.4
bytes clean, binary strings go in and binary strings go out, no LATIN1 transcoding AFAICT

Cons

Very undocumented
Dangerous, from a remote code execution (RCE) perspective. Probably more so than pickle.
Doesn't handle self referencing structures (TBC)

dw commented 4 years ago

I love the name mrna :)

I'd like to do some archeology and see what kept execnet away from marshal. It's definitely had historical security issues like stack/integer overflows, the question is whether they're relevant here.

If we support talking to Python 2.4, and 2.4 has a marshal module with some kind of RCE in it, it's fine so long as any trusted parent is running a patched marshal, since parent->child RCE is the whole point of the library. But e.g. going host A -> B -> C where bastion B is running an ancient Python, this cannot work where any response from C is interpreted by B (e.g. something like REQUEST_MODULE).

It'd be great to use this opportunity to eliminate any fear surrounding the serializer entirely. I think the strongest contender so far is pencode -- but its support for cyclical structures opens trusted code to algorithmic attacks (infinite recursion or loops), so the icing on the cake (I think) would be pencode minus memoization.

dw commented 4 years ago

An ominous message from the execnet sources (earliest commit in the current Git):

        # XXX marshal.dumps doesn't work for exchanging data across Python
        # version :-(((  There is no sane solution, short of a custom
        # pure Python marshaller

This may be referring to pre-2.4 Pythons, though

moreati commented 4 years ago

Very alpha pickle protocol 4 backport https://github.com/moreati/pickle4

moreati commented 4 years ago

Very alpha pickle protocol 4 backport https://github.com/moreati/pickle4

Fixed up the pickle4 backport enough to benchmark on Python 2.7 and 3.7. It's slow. There's a lot of branching, in pure python. Learned some ancient Python lore

Next, I'm going to look at pencode some more.

moreati commented 4 years ago

pencode_read5 is a variant of pencode. It decodes between 2 and 3 times faster than plain pencode. This brings it within 10x of cpickle.

Benchmark	CPython 2.7	CPython 3.6
cpickle,proto=2,dumps	21.1 ms +- 0.5 ms	9.09 ms +- 0.14 ms
cpickle,proto=2,loads	19.0 ms +- 0.9 ms	9.71 ms +- 0.22 ms
pencode,proto=None,dumps	49.7 ms +- 0.5 ms	64.0 ms +- 0.7 ms
pencode,proto=None,loads	141 ms +- 7 ms	194 ms +- 13 ms
pencode_read5,proto=None,dumps:	49.0 ms +- 0.8 ms	64.1 ms +- 0.8 ms
pencode_read5,proto=None,loads	59.9 ms +- 0.9 ms	78.0 ms +- 8.1 ms

To do this pencode_read5 removes the dedicated the op codes for None, True, and False. The uncompressed wire size becomes slightly larger as a result

moreati commented 1 year ago

Short to medium term ideas

Pickle protocol negotiation. e.g. if all parties are Python 3.x then use protocol 3, or a per-path negotiation ala TCP path MTU.
Configurable pickle protocol. e.g. allow master to dictate protocol=N, if any node can't talk that then throw an error.

mitogen-hq / mitogen

Python 3: ~100MB/sec vs. 800MB/sec FileService throughput #485