google / magika

Detect file content types with deep learning
https://google.github.io/magika/
Apache License 2.0
7.85k stars 413 forks source link

Convert python API into a dynamic-link library for Linux #679

Open 7012xxx opened 1 month ago

7012xxx commented 1 month ago

How can I build this Python API into a dynamic-link library for Linux, generating a file with the suffix '.so'?

reyammer commented 1 month ago

I believe the rust library would be useful for this use case? /cc @ia0?

ia0 commented 1 month ago

I'm not sure I fully understand the initial question. There's at least 2 ways to interpret it for someone like me unfamiliar with Python:

So I guess the best would be to know what the problem is rather than what a possible solution could be (XY problem).

@7012xxx could you help us understand what you are trying to do for which you believe a dynamic library could help? Thanks!

7012xxx commented 1 month ago

I'm not sure I fully understand the initial question. There's at least 2 ways to interpret it for someone like me unfamiliar with Python:

  • Provide a dynamic library of the Python API to be used in Python. (That's the part I don't know is possible.)
  • Provide a dynamic library of a C API similar to the Python API to be used by any language that dynamically links to C. (I'm also not sure if it's possible to do this from the Python library. It can probably be done from the Rust library although that depends on ort support. The API might also not really match the Python one.)

So I guess the best would be to know what the problem is rather than what a possible solution could be (XY problem).

@7012xxx could you help us understand what you are trying to do for which you believe a dynamic library could help? Thanks!

Thank you for your reply. I am looking to use Magika in an environment that does not have Python >=3.8. Additionally, I want to call Magika using both Golang and Python. During my research, I discovered that dynamic link libraries might meet my needs, enabling seamless calls between Golang and Python.

ia0 commented 1 month ago

Thanks, so it looks like for the Golang use-case, we have something planned: Providing a C API to the Rust library (ideally as both a static and dynamic library). But this still needs to be done. For the Python use-case, either the same library could be used (although since Python is not a compiled language, the static library probably won't be an option), or we could provide Python bindings to the Rust library using PyO3 and Maturin. This too would need to be designed and implemented.

You can follow #96 for the Golang use-case. For Python, we'll have to decide if we go with C (thus #90) or with PyO3 (which would required a new issue).

reyammer commented 1 month ago

My take for the python part (feedback is welcome):

@7012xxx: would the ctypes route around an .so file work for you?

ia0 commented 1 month ago
  • once one has a shared object .so with the main functionality (which we should be able to generate from the rust codebase, right @ia0?)

For Rust in general yes, but for this particular case depending on ONNX through the ort crate, I don't know. We would need to test it, but it's indeed part of the plan to try. I expect a static library to be simpler. I'm expecting to track this in #90.

7012xxx commented 1 month ago

I am eager to obtain the cdylib (i.e., so file) I need by using Rust. However, I lack relevant background knowledge in Rust. During the compilation process, I ran into a problem. Namely, I can't directly generate the so file under the /rust directory via "cargo build --release". It prompts me that the toml file needs to be edited. After several attempts after editing, I still can't successfully compile. Could you please provide a toml file example or help me compile magika into an so file?

7012xxx commented 1 month ago

I am eager to obtain the cdylib (i.e., so file) I need by using Rust. However, I lack relevant background knowledge in Rust. During the compilation process, I ran into a problem. Namely, I can't directly generate the so file under the /rust directory via "cargo build --release". It prompts me that the toml file needs to be edited. After several attempts after editing, I still can't successfully compile. Could you please provide a toml file example or help me compile magika into an so file?

I want to use Magika without relying on any other libraries and language environments.

ia0 commented 1 month ago

Here's an example on how to build a dynamic library in Rust and use it from C. Ultimately we'll build such a magika-api crate and provide a .h and .so file. But this is not yet on the table. This small example should provide you all the knowledge you need to build a C API that suits your needs.

ia0 commented 3 weeks ago

Hello, ia0. I have successfully compiled magika into a.so file according to the steps you provided. However, during the process of using the C API, I encountered the following error:

     python pyMagikaDemo.py libmagika_api.so
    Traceback (most recent call last):
      File "pyMagikaDemo.py", line 6, in <module>
        libmagika = ctypes.CDLL('./libmagika_api.so')
      File "/usr/lib64/python2.7/ctypes/__init__.py", line 360, in __init__
        self._handle = _dlopen(self._name, mode)
    OSError:./libmagika_api.so: undefined symbol: _ZNKSt19__codecvt_utf8_baseIwE6do_outER11__mbstate_tPKwS4_RS4_PcS6_RS6_.

After my inspection, I found that this symbol is missing in both./libmagika_api.so and libstdc++.so.6. Is this problem caused by not importing all the correct dependencies?

I'm following up here instead of by email because others might be to help you much better than me. _ZNKSt19__codecvt_utf8_baseIwE6do_outER11__mbstate_tPKwS4_RS4_PcS6_RS6_ should be in libstdc++.so.6. Does your magika depend on libstdc++.so.6? Mine does:

% readelf -d ../target/release/libmagika_api.so
[...]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
[...]

In particular, did you manage to successfully run make ARG=src/lib.rs on your machine in my branch?

7012xxx commented 3 weeks ago

Hello, ia0. I have successfully compiled magika into a.so file according to the steps you provided. However, during the process of using the C API, I encountered the following error:

     python pyMagikaDemo.py libmagika_api.so
    Traceback (most recent call last):
      File "pyMagikaDemo.py", line 6, in <module>
        libmagika = ctypes.CDLL('./libmagika_api.so')
      File "/usr/lib64/python2.7/ctypes/__init__.py", line 360, in __init__
        self._handle = _dlopen(self._name, mode)
    OSError:./libmagika_api.so: undefined symbol: _ZNKSt19__codecvt_utf8_baseIwE6do_outER11__mbstate_tPKwS4_RS4_PcS6_RS6_.

After my inspection, I found that this symbol is missing in both./libmagika_api.so and libstdc++.so.6. Is this problem caused by not importing all the correct dependencies?

I'm following up here instead of by email because others might be to help you much better than me. _ZNKSt19__codecvt_utf8_baseIwE6do_outER11__mbstate_tPKwS4_RS4_PcS6_RS6_ should be in libstdc++.so.6. Does your magika depend on libstdc++.so.6? Mine does:

% readelf -d ../target/release/libmagika_api.so
[...]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
[...]

In particular, did you manage to successfully run make ARG=src/lib.rs on your machine in my branch?

 [. . .]
    Finished `release` profile [optimized] target(s) in 1m 29s
gcc example.c -o example -lmagika_api -L../target/release
../target/release/libmagika_api.so: undefined reference to `std::out_of_range::out_of_range(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)'
../target/release/libmagika_api.so: undefined reference to `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace(unsigned long, unsigned long, char const*, unsigned long)'

I can successfully complete the compilation of the .so file. However, I cannot complete the compilation using gcc. It appears that libmagika_api.so is not linked successfully during the gcc compilation process. I believe this does not impact my usage of the .so file, so I haven't paid much attention to this issue.

ia0 commented 3 weeks ago

It looks to me like a problem with your platform. Maybe it's too old or it's missing some library. If you can't resolve it on your own, your best bet is to wait until we support this use-case. This is currently not supported.

7012xxx commented 3 weeks ago

The following are my Linux platform parameters.

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

The following is the dependency relationship of the libmagika_api.so that I compiled. As can be seen, libstdc++.so.6 has been installed.

    linux-vdso.so.1 =>  (0x00007ffec5b85000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f2a70ec5000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f2a70caf000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f2a70aa7000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f2a7088b000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f2a70589000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f2a70385000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f2a6ffb7000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f2a72a0c000)

It appears that everything seems to be in order. As a result, I truly cannot determine if there are any missing partial dependent libraries. Do you have any suggestions?

[root@sz-platform-operation-1 lib64]# strings /lib64/libstdc++.so.6 | grep LIBCXX
GLIBCXX_3.4
GLIBCXX_3.4.1
GLIBCXX_3.4.2
GLIBCXX_3.4.3
GLIBCXX_3.4.4
GLIBCXX_3.4.5
GLIBCXX_3.4.6
GLIBCXX_3.4.7
GLIBCXX_3.4.8
GLIBCXX_3.4.9
GLIBCXX_3.4.10
GLIBCXX_3.4.11
GLIBCXX_3.4.12
GLIBCXX_3.4.13
GLIBCXX_3.4.14
GLIBCXX_3.4.15
GLIBCXX_3.4.16
GLIBCXX_3.4.17
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_DEBUG_MESSAGE_LENGTH
[root@sz-platform-operation-1 lib64]# g++ --version
g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Is there a compatibility issue between the C++ standard library and GCC here?

7012xxx commented 3 weeks ago

@ia0 Could you provide the platform parameters when you compile with cargo? For example, the C++ version and Linux version. Also, which libraries does the compiled libmagika_api.so depend on?