Open jimcarreer opened 9 years ago
I think I'm going to take a look at this, this weekend.
An FYI, I've migrated to an essentially stateless Huffman decoding implementation in #35.
So this should be closed?
Nope, we still have a stateful encoder.
I can try to write it in C if that helps.
I would like to avoid us having any native extensions in this module at this time.
Even optionally?
Tentatively, yes, I'm opposed to having an optional native extension.
My thinking on this is a bit complex, so let me lay it out.
Generally speaking I have a policy regarding C code, which is that wherever possible I'd like to avoid having untrusted input reach C code. The risks of running C code on untrusted input are too high, and I find that generally speaking I and most software developers I work with are bad at spotting places where assumptions are made that data will be "reasonable". That makes C a great vector for bugs, and in particular for the kind of bug that does not affect a pure-Python program.
For that reason, if we were going to write a native extension I'd want it to be in Rust, not C. However,
The wheel ecosystem is not quite there yet. While it's now mostly possible to ship binary wheels for most platforms, there are a number of Linux distributions that cannot use manylinux1 wheels. This is a problem if you want to ship a Rust extension, because it's very difficult to indicate the need for a Rust compile chain on a random Linux system.
This problem also affects C extensions, by the way: one of the systems that doesn't support a manylinux1 wheel is Alpine, which uses musl libc. That means that if we shipped a C extension on that platform, we'd require the installation of a C toolchain for users to install the module: not great.
So the only option is, as you said, to make the module optional: we'd ideally want to check whether we could compile it and, if we couldn't, we'd simply fall back to pure-Python code.
I have extremely mixed feelings about this, most of them not good. In particular, it's a bit of a hairy debugging issue: what code the user is executing becomes conditional on difficult-to-introspect features of their platform; we double the number of code paths that need to be tested and debugged; and we double the workload for feature addition and API changes.
The TL;DR here, I think, is that right now I'm about -0.25 on having a native extension. If we could come up with a nice Rust extension then I'd reconsider, but I'd really rather not maintain a C extension if it can possibly be avoided.
In fact, let state an outline of what we'd need to make me happy here.
If any contributor can write a Rust extension that can be optionally compiled if a buildchain is present, that flags its presence or absence clearly in a module-scoped dunder variable, that uses CFFI, and that doesn't vomit a load of output into the terminal on install on systems that do not have a Rust toolchain present, I'd be willing to consider merging such an extension.
It shouldn't be much of a problem with https://github.com/novocaine/rust-python-ext
I've already written a Rust extension module in Python, that's not the hard bit. The harder bits are a) making it fail gracefully, b) keeping it maintained, and c) actually writing the extension. ;)
Right, this might take some effort since I'm not very familiar with Rust's lifelines and borrow semantics. I will give it a shot though.
As discussed with @Lukasa the Huffman encoder / decoder is currently initialized using two static structures defined in huffman_constants.py. We could potentially remove the need for this pseudo-stateful nature and instead have a completely stateless Huffman encoder / decoder with the addition of another static structure related to decoding (encoding is currently trivial to make stateless). This may also offer a minor performance enhancement.