google / magika

Detect file content types with deep learning
https://google.github.io/magika/
Apache License 2.0
7.79k stars 412 forks source link

Test python manylinux package #752

Open reyammer opened 4 days ago

reyammer commented 4 days ago

Creating a proper manylinux python package is challenging for us due to rust's magika depending on ort, and ort requiring GLIBC >= 2.31 for pre-built binaries while the latest manylinux is 2_28. See https://github.com/google/magika/pull/747.

For now, we implemented a hack (see https://github.com/google/magika/pull/748) that should work well enough: we generate a generic "linux" python package (with an old ubuntu github runner), and then we patch it so that it looks like a manylinux package. This should work because the rust magika's binary only depend on very few ubiquitous libraries, and we don't expect problems in most cases.

That being said, more tests are needed to check how frequently this does not work.

reyammer commented 4 days ago

This approach seems to break build on ubuntu 20.04: https://github.com/google/magika/actions/runs/11274566487/job/31353998130#step:5:1898

ia0 commented 4 days ago

The actual requirement seems to be glibc >= 2.35 (ubuntu 22.04), see https://github.com/pykeio/ort/pull/293.

reyammer commented 3 days ago

Can we link against an old version of ort crate so that we don't need such a high version of glibc?

ia0 commented 3 days ago

Can we link against an old version of ort crate so that we don't need such a high version of glibc?

That could be an option, but that would mean:

reyammer commented 23 hours ago

Ok, good points, linking against old crate could result in even more headaches. Let's put this plan aside for the moment.

Thought about two other things:

Thoughts? /cc @invernizzi

invernizzi commented 23 hours ago

I think we can follow the proper route of compiling rust binaries that are manylinux compatible, likely using https://github.com/rust-cross/manylinux-cross

This blog post can be our starting point: https://medium.com/@urschrei/building-manylinux-compatible-rust-binaries-for-use-in-python-wheels-d5d943619af2 (it's a bit old, so likely needs some updates)

I'll try to set up a minimal project to do that, and we can see what's possible

invernizzi commented 23 hours ago

Also, https://github.com/pypa/auditwheel seems very useful

reyammer commented 22 hours ago

@invernizzi my understanding from @ia0 is that we can't do that due to problems with the ort crate, which we depend on. To my understanding, ort crate has dependencies on a too-high version of glibc, which is too high for the currently available manylinux options.

BTW, as of now I don't think we have any way to have a magika rust binary that works on ubuntu 20.04 (which is still supported until next year), so this seems a problem bigger than the mere python packaging.

invernizzi commented 21 hours ago

Understood.

I put together a repo anyway to test this, as it can be useful once manylinux is compatible. However, I haven't run in the same issue. Maybe I'm missing something.

Here it is. It depends on the same ort version (ort-sys v2.0.0-rc.6), though it doesn't actually use it for anything.

The guide on how to run it is on that repo. Salient bits:

Downloaded ort-sys v2.0.0-rc.6
[...]
🍹 Building a mixed python/rust project
🔗 Found bin bindings
📡 Using build options bindings from pyproject.toml
🎯 Found 1 Cargo targets in `Cargo.toml`: hello
    Finished `release` profile [optimized] target(s) in 0.09s
📦 Built wheel to /rust/hello/target/wheels/hello-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Installing python wheel
Processing ./rust/hello/target/wheels/hello-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Installing collected packages: hello
Successfully installed hello-0.1.0
Testing Python import
Hello from hello!  <- the python code works
Auditing the wheel

hello-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl is
consistent with the following platform tag: "manylinux_2_17_x86_64".

The wheel references external versioned symbols in these
system-provided shared libraries: libgcc_s.so.1 with versions
{'GCC_3.3', 'GCC_4.2.0', 'GCC_3.0'}, libpthread.so.0 with versions
{'GLIBC_2.2.5'}, libc.so.6 with versions {'GLIBC_2.2.5',
'GLIBC_2.3.4', 'GLIBC_2.16', 'GLIBC_2.14', 'GLIBC_2.3'}

This constrains the platform tag to "manylinux_2_17_x86_64". In order
to achieve a more compatible tag, you would need to recompile a new
wheel from source on a system with earlier versions of these
libraries, such as a recent manylinux image.
Running the executable
Hello, world!  <- the rust binary works.

After that, it can be installed in Ubuntu 20.04 (see the README.md in that repo). Salient bits:

Processing /workspace/rust/hello/target/wheels/hello-0.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Installing collected packages: hello
Successfully installed hello-0.1.0
Testing Python import
Hello from hello!
Running the executable
Hello, world!
Dropping into an interpreter
root@2913ecd93d74:/#

It seems it's working out fine

invernizzi commented 21 hours ago

The issue might only present itself when ort is actually used - I haven't tested that. I'll leave it to the great @ia0 to tell me what it is I'm not seeing here

ia0 commented 21 hours ago

The difference is probably that you need to add the download-binaries feature of ort. Since compiling ONNX Runtime from source is the proper solution. So the problem is not the ort crate itself, just the prebuilt binaries it provides for convenience.

reyammer commented 21 hours ago

@invernizzi nice!

So the problem is not the ort crate itself, just the prebuilt binaries it provides for convenience.

@ia0 I see... but is there any big challenge in compiling from source then? And it makes sense that the issue is with built-in binaries... I would expect it to be relatively rare to truly use glibc features that are so cutting edge...

ia0 commented 21 hours ago

I don't know, it just needs to be done, and confirm that there are no other problems on that new step.

reyammer commented 21 hours ago

The issue might only present itself when ort is actually used - I haven't tested that. I'll leave it to the great @ia0 to tell me what it is I'm not seeing here

@invernizzi, interesting; this looks very promising. For context, we detected the problems when we tried to create the python package on a ubuntu 20.04 machine: something related to ort complains about "undefined reference to symbol XYZ in glibc"... which makes sense, as it seems some prebuilt binaries rely on a newer version of glibc. The difference from your setup seems to be: you built the package on a recent ubuntu, and then you tried to install and run on a 20.04 machine... which could work for us! The concern I have is: are we sure it would actually work? It seems that for now we are relaying on prebuilt binaries... that apparently rely on symbols not present in old glibc... can it be that we don't see problems because this setup relies on lazy loading of the dynamically linked dependencies? Maybe we could try to run the binary with LD_BIND_NOW=1 to disable lazy loading (https://man7.org/linux/man-pages/man8/ld.so.8.html)...

I don't know, it just needs to be done, and confirm that there are no other problems on that new step.

ACK. So, it could really be that ort per-se doesn't truly rely on newer glibc, and it could be this problem goes away. I think this is the most reasonable next step. @ia0 please go for it when you have some time and let us know whether there are unexpected challenges! Thank you!