google / magika

Detect file content types with deep learning
https://google.github.io/magika/
Apache License 2.0
7.99k stars 418 forks source link

Keep support for model v1 within the rust client #593

Open yaniv5678 opened 3 months ago

yaniv5678 commented 3 months ago

Hi, Is there any possibility to add support for running v1 version of the model? Maybe in a Cargo.toml feature.

Thanks!

reyammer commented 1 month ago

Hello,

what would the use case for this? Ensure backward compatibility? Or you noticed specific issues with v2? Additional context would help!

yaniv5678 commented 1 month ago

As far as I seen, v2 is slower than v1 as it is a larger model, so we prefer using v1 for working with data in scale :)

reyammer commented 1 month ago

thanks for the feedback, this is very useful! /cc @ia0 @invernizzi

We do have a smaller version of our v2 models (https://github.com/google/magika/tree/main/assets/models/fast_v2_1), it can be easily used by the python module (by pointing model_dir to it), but it's not integrated yet with the rust codebase (and it's currently not so trivial to do so).

We'll discuss internally on how to approach this. In the meantime, please let us know if you have additional context to share. For example: it seems you are integrating the magika rust cli within an existing pipeline... in which language is this pipeline written? If, for example, the pipeline is written in python, the python module would be the wait to go: most of rust's performance improvements are about avoiding the initial one-off starting time, and after things are loaded in python, the inference time rust vs python should be roughly the same.