google / magika

Detect file content types with deep learning
https://google.github.io/magika/
Apache License 2.0
7.54k stars 395 forks source link

Make application / bindings for Rust #93

Open reyammer opened 4 months ago

reyammer commented 4 months ago

When using magika as a python command line tool, the biggest overhead comes from starting the python interpreter itself and loading all the libraries.

Onnx's model loading itself is relatively fast (~30ms), the content type inference is ~5ms. But the overall python CLI takes about ~300ms.

Having a client written in Rust may make the CLI 8x/9x/10x faster.

This is not a concern for large scale automated pipelines (as they would boostrap the library and load the model only once), but it is annoying for one-off CLI use cases.

If this works out, then we could make this new rust client the one installed via pip install magika (together with the python API).

We may also want to have a proper way to make magika available to rust applications.

Note for external contributors: this may become the future de-facto magika client, so its design will need extra care. We are reaching out internally as well, so please let us know if you are starting working on this so that we can coordinate better and/or avoid duplicated work. Thank you!

reyammer commented 4 months ago

Update: We have some POC working and someone working on it. Placeholder in #180.

maekawatoshiki commented 3 months ago

I'm aware that magika written in Rust is entirely still in progress, but using onnxruntime crate (last updated 2yrs ago) in it does not seem reasonable. How about using ort crate?

reyammer commented 3 months ago

Thanks for raising this! @ia0 thoughts?

ia0 commented 3 months ago

This looks really good. I'm unhappy with onnxruntime but that's the only thing I found on crates.io (the search is really bad). I'll migrate to ort.

ia0 commented 3 months ago

The PR switching from onnxruntime to ort is:

cr-itay-baranes commented 1 month ago

@reyammer With a few adjustments, this Rust library can also be made compatible with Windows. For example, use the appropriate library for the platform, whether Windows or Unix. https://github.com/google/magika/blob/d5dbd03a9319643a60be9343abcededdcd47f813/rust/lib/src/input.rs#L18 Should be updated as well, using attributes for conditional compilation, such as #[cfg(target_family = "windows")] https://github.com/google/magika/blob/d5dbd03a9319643a60be9343abcededdcd47f813/rust/lib/src/input.rs#L69

cr-itay-baranes commented 1 month ago

@ia0 What should be the Rust impl equivalent of the python Result.output.ct_label? It seems that only the deep learning model result is reflected in the Rust impl, right ?

ia0 commented 1 month ago

It seems that only the deep learning model result is reflected in the Rust impl, right ?

Yes that's correct. At the moment the Rust library only provides access to the deep learning model. It is planned (soon) to make it closer to the Python library (including computing ct_label without deep learning for small files).

What should be the Rust impl equivalent of the python Result.output.ct_label?

This should be magika::Label. It already contains the Txt and Unknown labels, but they are not yet used when deep learning is bypassed (because that bypass is not yet implemented, see above).

ia0 commented 3 days ago

Once #555 is merged, the FileType enum will provide the details regarding the InferredType (from the model) and the RuledType (based on rules). In this second case, the overruled field will contain the InferredType if the rule overruled an InferredType. Those details can be abstracted by calling FileType::info() (which returns file type information like description, group, MIME type, extensions, etc) or FileType::content_type() (which returns the content type if the file has content).