Open reyammer opened 4 months ago
Update: We have some POC working and someone working on it. Placeholder in #180.
I'm aware that magika written in Rust is entirely still in progress, but using onnxruntime
crate (last updated 2yrs ago) in it does not seem reasonable. How about using ort crate?
Thanks for raising this! @ia0 thoughts?
This looks really good. I'm unhappy with onnxruntime
but that's the only thing I found on crates.io (the search is really bad). I'll migrate to ort
.
The PR switching from onnxruntime
to ort
is:
@reyammer
With a few adjustments, this Rust library can also be made compatible with Windows.
For example, use the appropriate library for the platform, whether Windows or Unix.
https://github.com/google/magika/blob/d5dbd03a9319643a60be9343abcededdcd47f813/rust/lib/src/input.rs#L18
Should be updated as well, using attributes for conditional compilation, such as #[cfg(target_family = "windows")]
https://github.com/google/magika/blob/d5dbd03a9319643a60be9343abcededdcd47f813/rust/lib/src/input.rs#L69
@ia0 What should be the Rust impl equivalent of the python Result.output.ct_label? It seems that only the deep learning model result is reflected in the Rust impl, right ?
It seems that only the deep learning model result is reflected in the Rust impl, right ?
Yes that's correct. At the moment the Rust library only provides access to the deep learning model. It is planned (soon) to make it closer to the Python library (including computing ct_label without deep learning for small files).
What should be the Rust impl equivalent of the python Result.output.ct_label?
This should be magika::Label
. It already contains the Txt
and Unknown
labels, but they are not yet used when deep learning is bypassed (because that bypass is not yet implemented, see above).
Once #555 is merged, the FileType
enum will provide the details regarding the InferredType
(from the model) and the RuledType
(based on rules). In this second case, the overruled
field will contain the InferredType
if the rule overruled an InferredType
. Those details can be abstracted by calling FileType::info()
(which returns file type information like description, group, MIME type, extensions, etc) or FileType::content_type()
(which returns the content type if the file has content).
When using magika as a python command line tool, the biggest overhead comes from starting the python interpreter itself and loading all the libraries.
Onnx's model loading itself is relatively fast (~30ms), the content type inference is ~5ms. But the overall python CLI takes about ~300ms.
Having a client written in Rust may make the CLI 8x/9x/10x faster.
This is not a concern for large scale automated pipelines (as they would boostrap the library and load the model only once), but it is annoying for one-off CLI use cases.
If this works out, then we could make this new rust client the one installed via
pip install magika
(together with the python API).We may also want to have a proper way to make magika available to rust applications.
Note for external contributors: this may become the future de-facto magika client, so its design will need extra care. We are reaching out internally as well, so please let us know if you are starting working on this so that we can coordinate better and/or avoid duplicated work. Thank you!