huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.04k stars 799 forks source link

Rust: How to handle models with `precompiled_charsmap = null` #1627

Open kallebysantos opened 2 months ago

kallebysantos commented 2 months ago

Hi guys, I'm currently working on https://github.com/supabase/edge-runtime/pull/368 that pretends to add a rust implementation of pipeline().

While I was coding the translation task I figured out that I can't load the Tokenizer instance for Xenova/opus-mt-en-fr onnx model and their other opus-mt-* variants.

I got the following: ```rust let tokenizer_path = Path::new("opus-mt-en-fr/tokenizer.json"); let tokenizer = Tokenizer::from_file(tokenizer_path).unwrap(); ``` ``` thread 'main' panicked at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/normalizers/mod.rs:143:26: Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28) stack backtrace: 0: rust_begin_unwind at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/std/src/panicking.rs:662:5 1: core::panicking::panic_fmt at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/panicking.rs:74:14 2: core::result::unwrap_failed at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/result.rs:1679:5 3: core::result::Result::expect at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/result.rs:1059:23 4: ::deserialize at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/normalizers/mod.rs:139:25 5: as serde::de::Visitor>::visit_some at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/impls.rs:916:9 6: <&mut serde_json::de::Deserializer as serde::de::Deserializer>::deserialize_option at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1672:18 7: serde::de::impls::>::deserialize at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/impls.rs:935:9 8: as serde::de::DeserializeSeed>::deserialize at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/mod.rs:801:9 9: as serde::de::MapAccess>::next_value_seed at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2008:9 10: serde::de::MapAccess::next_value at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.207/src/de/mod.rs:1874:9 11: as serde::de::Visitor>::visit_map at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/serialization.rs:132:55 12: <&mut serde_json::de::Deserializer as serde::de::Deserializer>::deserialize_struct at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1840:31 13: tokenizers::tokenizer::serialization::>::deserialize at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/serialization.rs:62:9 14: ::deserialize::__Visitor as serde::de::Visitor>::visit_newtype_struct at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:408:21 15: <&mut serde_json::de::Deserializer as serde::de::Deserializer>::deserialize_newtype_struct at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:1723:9 16: tokenizers::tokenizer::_::::deserialize at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:408:21 17: serde_json::de::from_trait at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2478:22 18: serde_json::de::from_str at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde_json-1.0.124/src/de.rs:2679:5 19: tokenizers::tokenizer::Tokenizer::from_file at /home/kalleby/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.20.0/src/tokenizer/mod.rs:439:25 20: transformers_rs::pipeline::tasks::seq_to_seq::seq_to_seq at ./src/pipeline/tasks/seq_to_seq.rs:51:21 21: app::main at ./examples/app/src/main.rs:78:5 22: core::ops::function::FnOnce::call_once at /rustc/80eb5a8e910e5185d47cdefe3732d839c78a5e7e/library/core/src/ops/function.rs:250:5 ```

I now that it occurs because their tokenizer.json file was the following:

opus-mt-en-fr:

"normalizer": {
    "type": "Precompiled",
    "precompiled_charsmap": null
}

While the expected behavior must be something like this:

nllb-200-distilled-600M:

"normalizer": {                           
   "type": "Sequence",                     
   "normalizers": [                        
     {                                     
       "type": "Precompiled",              
       "precompiled_charsmap": "ALQCAACEAAA..."
     }                                    
   ]                                       
 }

Looking in the original version of Helsinki-NLP/opus-mt-en-fr I notice that is no tokenizer.json file for it.

I would like to know if is the precompiled_charsmap necessary expect a non-null?

Maybe it could be handle as Option<_>?

Is there some workaround to execute theses models without change the internal model files? How can I handle an exported onnx model that doesn't have the tokenizer.json file?

ankane commented 1 month ago

I'm seeing the same error with Python when trying to read the tokenizer from Xenova/speecht5_tts.

wget https://huggingface.co/Xenova/speecht5_tts/resolve/main/tokenizer.json
from tokenizers import Tokenizer

Tokenizer.from_file("tokenizer.json")
thread '<unnamed>' panicked at /Users/runner/work/tokenizers/tokenizers/tokenizers/src/normalizers/mod.rs:143:26:
Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
...
pyo3_runtime.PanicException: Precompiled: Error("invalid type: null, expected a borrowed string", line: 1, column: 28)

With Tokenizers 0.19.0, this raised an error which could be handled rather than a panic. It looks like this may be related to #1604.

vicantwin commented 1 month ago

I'm also facing the same issue (#1645) with speecht5_tts.

ArthurZucker commented 1 month ago

I think passing a "" might work. cc @xenova not sure why you end up with nulls there, but we can probably syn and I can add support for option!

kallebysantos commented 1 month ago

I think passing a "" might work. cc @xenova not sure why you end up with nulls there, but we can probably syn and I can add support for option!

Xenova implementation doesn't call the value directly but applies iterators over config normalizers. I think that it ignores the null values.

I agree with you, add support for Option<> may solve it.

vicantwin commented 1 month ago

I've implemented spm_precompiled with null support at vicantwin/spm_precompiled, which includes a test with null support, and all tests pass successfully.

But, I need some help with changing this repository, as I'm not entirely familiar with this codebase and unsure how to implement the necessary changes. Any help would be greatly appreciated.