huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.68k stars 745 forks source link

error: casting `&T` to `&mut T` is undefined behavior #1485

Closed Jipok closed 1 month ago

Jipok commented 3 months ago

ERROR: Failed building wheel for tokenizers:

Building wheels for collected packages: tokenizers
  Building wheel for tokenizers (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for tokenizers (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [592 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-312
      creating build/lib.linux-x86_64-cpython-312/tokenizers
      copying py_src/tokenizers/__init__.py -> build/lib.linux-x86_64-cpython-312/tokenizers
      creating build/lib.linux-x86_64-cpython-312/tokenizers/models
      copying py_src/tokenizers/models/__init__.py -> build/lib.linux-x86_64-cpython-312/tokenizers/models
      creating build/lib.linux-x86_64-cpython-312/tokenizers/decoders
      copying py_src/tokenizers/decoders/__init__.py -> build/lib.linux-x86_64-cpython-312/tokenizers/decoders
      creating build/lib.linux-x86_64-cpython-312/tokenizers/normalizers
      copying py_src/tokenizers/normalizers/__init__.py -> build/lib.linux-x86_64-cpython-312/tokenizers/normalizers
      creating build/lib.linux-x86_64-cpython-312/tokenizers/pre_tokenizers
      copying py_src/tokenizers/pre_tokenizers/__init__.py -> build/lib.linux-x86_64-cpython-312/tokenizers/pre_tokenizers
      creating build/lib.linux-x86_64-cpython-312/tokenizers/processors
      copying py_src/tokenizers/processors/__init__.py -> build/lib.linux-x86_64-cpython-312/tokenizers/processors
      creating build/lib.linux-x86_64-cpython-312/tokenizers/trainers
      copying py_src/tokenizers/trainers/__init__.py -> build/lib.linux-x86_64-cpython-312/tokenizers/trainers
      creating build/lib.linux-x86_64-cpython-312/tokenizers/implementations
      copying py_src/tokenizers/implementations/__init__.py -> build/lib.linux-x86_64-cpython-312/tokenizers/implementations
      copying py_src/tokenizers/implementations/base_tokenizer.py -> build/lib.linux-x86_64-cpython-312/tokenizers/implementations
      copying py_src/tokenizers/implementations/bert_wordpiece.py -> build/lib.linux-x86_64-cpython-312/tokenizers/implementations
      copying py_src/tokenizers/implementations/byte_level_bpe.py -> build/lib.linux-x86_64-cpython-312/tokenizers/implementations
      copying py_src/tokenizers/implementations/char_level_bpe.py -> build/lib.linux-x86_64-cpython-312/tokenizers/implementations
      copying py_src/tokenizers/implementations/sentencepiece_bpe.py -> build/lib.linux-x86_64-cpython-312/tokenizers/implementations
      copying py_src/tokenizers/implementations/sentencepiece_unigram.py -> build/lib.linux-x86_64-cpython-312/tokenizers/implementations
      creating build/lib.linux-x86_64-cpython-312/tokenizers/tools
      copying py_src/tokenizers/tools/__init__.py -> build/lib.linux-x86_64-cpython-312/tokenizers/tools
      copying py_src/tokenizers/tools/visualizer.py -> build/lib.linux-x86_64-cpython-312/tokenizers/tools
      copying py_src/tokenizers/__init__.pyi -> build/lib.linux-x86_64-cpython-312/tokenizers
      copying py_src/tokenizers/models/__init__.pyi -> build/lib.linux-x86_64-cpython-312/tokenizers/models
      copying py_src/tokenizers/decoders/__init__.pyi -> build/lib.linux-x86_64-cpython-312/tokenizers/decoders
      copying py_src/tokenizers/normalizers/__init__.pyi -> build/lib.linux-x86_64-cpython-312/tokenizers/normalizers
      copying py_src/tokenizers/pre_tokenizers/__init__.pyi -> build/lib.linux-x86_64-cpython-312/tokenizers/pre_tokenizers
      copying py_src/tokenizers/processors/__init__.pyi -> build/lib.linux-x86_64-cpython-312/tokenizers/processors
      copying py_src/tokenizers/trainers/__init__.pyi -> build/lib.linux-x86_64-cpython-312/tokenizers/trainers
      copying py_src/tokenizers/tools/visualizer-styles.css -> build/lib.linux-x86_64-cpython-312/tokenizers/tools
      running build_ext
      running build_rust
          Updating crates.io index
      cargo rustc --lib --message-format=json-render-diagnostics --manifest-path Cargo.toml --release -v --features pyo3/extension-module --crate-type cdylib --
         Compiling libc v0.2.153
         Compiling proc-macro2 v1.0.79
         Compiling unicode-ident v1.0.12
         Compiling autocfg v1.2.0
         Compiling pkg-config v0.3.30
         Compiling cfg-if v1.0.0
         Compiling typenum v1.17.0
         Compiling memchr v2.7.2
         Compiling version_check v0.9.4
         Compiling once_cell v1.19.0
         Compiling syn v1.0.109
         Compiling pin-project-lite v0.2.14
         Compiling target-lexicon v0.12.14
         Compiling vcpkg v0.2.15
         Compiling bitflags v2.5.0
         Compiling bytes v1.6.0
         Compiling itoa v1.0.11
         Compiling subtle v2.5.0
         Compiling futures-core v0.3.30
         Compiling serde v1.0.197
         Compiling crossbeam-utils v0.8.19
         Compiling openssl v0.10.64
         Compiling fnv v1.0.7
           Running `rustc --crate-name build_script_build --edition=2021 /home/sd/.cargo/registry/src/index.crates.io-6f17d22bba15001f/proc-macro2-1.0.79/build.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type bin --emit=dep-info,link -C embed-bitcode=no -C debug-assertions=off --cfg 'feature="default"' --cfg 'feature="proc-macro"' -C metadata=b4ee986c80539004 -C extra-filename=-b4ee986c80539004 --out-dir /tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/build/proc-macro2-b4ee986c80539004 -L dependency=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps --cap-lints allow`
  ...
         Compiling tokenizers v0.13.3 (/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/tokenizers-lib)
           Running `rustc --crate-name tokenizers --edition=2018 tokenizers-lib/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no --cfg 'feature="cached-path"' --cfg 'feature="clap"' --cfg 'feature="cli"' --cfg 'feature="default"' --cfg 'feature="dirs"' --cfg 'feature="esaxx_fast"' --cfg 'feature="http"' --cfg 'feature="indicatif"' --cfg 'feature="onig"' --cfg 'feature="progressbar"' --cfg 'feature="reqwest"' -C metadata=7328a86746abf437 -C extra-filename=-7328a86746abf437 --out-dir /tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps -L dependency=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps --extern aho_corasick=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libaho_corasick-4b3322f33dd90c4d.rmeta --extern cached_path=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libcached_path-5ed5128f026500fa.rmeta --extern clap=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libclap-75fa0696f6a35286.rmeta --extern derive_builder=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libderive_builder-8a346deffc2ebe1e.rmeta --extern dirs=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libdirs-b91e2b12ef7b3a26.rmeta --extern esaxx_rs=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libesaxx_rs-ec5f02997062ab07.rmeta --extern getrandom=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libgetrandom-93dfba45fc2b8e30.rmeta --extern indicatif=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libindicatif-3730620e4a8e6215.rmeta --extern itertools=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libitertools-2aa7fd4d247f314a.rmeta --extern lazy_static=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/liblazy_static-d60e4dd36b567e7f.rmeta --extern log=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/liblog-e7360a68c8f9fdb6.rmeta --extern macro_rules_attribute=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libmacro_rules_attribute-0c5e18dae1223f79.rmeta --extern monostate=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libmonostate-cd0f5691d941ce54.rmeta --extern onig=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libonig-18382c6494a03e83.rmeta --extern paste=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libpaste-24a5791047389e1c.so --extern rand=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/librand-c6f59ec8eb990809.rmeta --extern rayon=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/librayon-d58498cddcaf228e.rmeta --extern rayon_cond=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/librayon_cond-81749514621b9292.rmeta --extern regex=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libregex-0296cd254892a4e0.rmeta --extern regex_syntax=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libregex_syntax-899dc8a5500cf19b.rmeta --extern reqwest=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libreqwest-e23a1f2e52abfda1.rmeta --extern serde=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libserde-86a61dd17283abc2.rmeta --extern serde_json=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libserde_json-4224df968acf122a.rmeta --extern spm_precompiled=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libspm_precompiled-a430935e8998b536.rmeta --extern thiserror=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libthiserror-8c7345053709316a.rmeta --extern unicode_normalization_alignments=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libunicode_normalization_alignments-da6b9466ba182084.rmeta --extern unicode_segmentation=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libunicode_segmentation-71eb6260cf3865ac.rmeta --extern unicode_categories=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libunicode_categories-166b45fdd025eb04.rmeta -L native=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/build/bzip2-sys-503e67e92eb0af78/out/lib -L native=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/build/zstd-sys-70e1cad5ab897179/out -L native=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/build/esaxx-rs-f91c3d0a3966aca1/out -L native=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/build/onig_sys-20fbd3f96135b358/out`
      warning: variable does not need to be mutable
         --> tokenizers-lib/src/models/unigram/model.rs:265:21
          |
      265 |                 let mut target_node = &mut best_path_ends_at[key_pos];
          |                     ----^^^^^^^^^^^
          |                     |
          |                     help: remove this `mut`
          |
          = note: `#[warn(unused_mut)]` on by default

      warning: variable does not need to be mutable
         --> tokenizers-lib/src/models/unigram/model.rs:282:21
          |
      282 |                 let mut target_node = &mut best_path_ends_at[starts_at + mblen];
          |                     ----^^^^^^^^^^^
          |                     |
          |                     help: remove this `mut`

      warning: variable does not need to be mutable
         --> tokenizers-lib/src/pre_tokenizers/byte_level.rs:200:59
          |
      200 |     encoding.process_tokens_with_offsets_mut(|(i, (token, mut offsets))| {
          |                                                           ----^^^^^^^
          |                                                           |
          |                                                           help: remove this `mut`

      error: casting `&T` to `&mut T` is undefined behavior, even if the reference is unused, consider instead using an `UnsafeCell`
         --> tokenizers-lib/src/models/bpe/trainer.rs:526:47
          |
      522 |                     let w = &words[*i] as *const _ as *mut _;
          |                             -------------------------------- casting happend here
      ...
      526 |                         let word: &mut Word = &mut (*w);
          |                                               ^^^^^^^^^
          |
          = note: for more information, visit <https://doc.rust-lang.org/book/ch15-05-interior-mutability.html>
          = note: `#[deny(invalid_reference_casting)]` on by default

      warning: `tokenizers` (lib) generated 3 warnings
      error: could not compile `tokenizers` (lib) due to 1 previous error; 3 warnings emitted

      Caused by:
        process didn't exit successfully: `rustc --crate-name tokenizers --edition=2018 tokenizers-lib/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no --cfg 'feature="cached-path"' --cfg 'feature="clap"' --cfg 'feature="cli"' --cfg 'feature="default"' --cfg 'feature="dirs"' --cfg 'feature="esaxx_fast"' --cfg 'feature="http"' --cfg 'feature="indicatif"' --cfg 'feature="onig"' --cfg 'feature="progressbar"' --cfg 'feature="reqwest"' -C metadata=7328a86746abf437 -C extra-filename=-7328a86746abf437 --out-dir /tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps -L dependency=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps --extern aho_corasick=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libaho_corasick-4b3322f33dd90c4d.rmeta --extern cached_path=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libcached_path-5ed5128f026500fa.rmeta --extern clap=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libclap-75fa0696f6a35286.rmeta --extern derive_builder=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libderive_builder-8a346deffc2ebe1e.rmeta --extern dirs=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libdirs-b91e2b12ef7b3a26.rmeta --extern esaxx_rs=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libesaxx_rs-ec5f02997062ab07.rmeta --extern getrandom=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libgetrandom-93dfba45fc2b8e30.rmeta --extern indicatif=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libindicatif-3730620e4a8e6215.rmeta --extern itertools=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libitertools-2aa7fd4d247f314a.rmeta --extern lazy_static=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/liblazy_static-d60e4dd36b567e7f.rmeta --extern log=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/liblog-e7360a68c8f9fdb6.rmeta --extern macro_rules_attribute=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libmacro_rules_attribute-0c5e18dae1223f79.rmeta --extern monostate=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libmonostate-cd0f5691d941ce54.rmeta --extern onig=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libonig-18382c6494a03e83.rmeta --extern paste=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libpaste-24a5791047389e1c.so --extern rand=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/librand-c6f59ec8eb990809.rmeta --extern rayon=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/librayon-d58498cddcaf228e.rmeta --extern rayon_cond=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/librayon_cond-81749514621b9292.rmeta --extern regex=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libregex-0296cd254892a4e0.rmeta --extern regex_syntax=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libregex_syntax-899dc8a5500cf19b.rmeta --extern reqwest=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libreqwest-e23a1f2e52abfda1.rmeta --extern serde=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libserde-86a61dd17283abc2.rmeta --extern serde_json=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libserde_json-4224df968acf122a.rmeta --extern spm_precompiled=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libspm_precompiled-a430935e8998b536.rmeta --extern thiserror=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libthiserror-8c7345053709316a.rmeta --extern unicode_normalization_alignments=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libunicode_normalization_alignments-da6b9466ba182084.rmeta --extern unicode_segmentation=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libunicode_segmentation-71eb6260cf3865ac.rmeta --extern unicode_categories=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/deps/libunicode_categories-166b45fdd025eb04.rmeta -L native=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/build/bzip2-sys-503e67e92eb0af78/out/lib -L native=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/build/zstd-sys-70e1cad5ab897179/out -L native=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/build/esaxx-rs-f91c3d0a3966aca1/out -L native=/tmp/pip-install-bvuco56u/tokenizers_8fea27572d074ce5977e58f1408074ea/target/release/build/onig_sys-20fbd3f96135b358/out` (exit status: 1)
      error: `cargo rustc --lib --message-format=json-render-diagnostics --manifest-path Cargo.toml --release -v --features pyo3/extension-module --crate-type cdylib --` failed with code 101
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tokenizers

Full log: https://snips.sh/f/XPSNFeHd9Q Python 3.12.2 rustc 1.76.0 (07dca489a 2024-02-04) (Void Linux)

austinleroy commented 2 months ago

While the code in this repo should be fixed, a temporary workaround is to use an older version of the rust toolchain (I had success with rust 1.72.0, installing version 0.13.2):

RUSTUP_TOOLCHAIN=1.72.0 pip install tokenizers==0.13.2

Originally I was trying to install 0.13.3, but ran into issues because the clap dependency requires rust 1.74 or newer.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

fcolecumberri commented 1 month ago

making github-actions close an obvious bug just because no one made comments on it doesn't make the bug go away.

ArthurZucker commented 1 month ago

Pretty sure this was fixed

Arondight commented 1 month ago

rust 1:1.78.0-1

         Compiling tokenizers v0.13.3 (/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/tokenizers-lib)
           Running `rustc --crate-name tokenizers --edition=2018 tokenizers-lib/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no --cfg 'feature="cached-path"' --cfg 'feature="clap"' --cfg 'feature="cli"' --cfg 'feature="default"' --cfg 'feature="dirs"' --cfg 'feature="esaxx_fast"' --cfg 'feature="http"' --cfg 'feature="indicatif"' --cfg 'feature="onig"' --cfg 'feature="progressbar"' --cfg 'feature="reqwest"' -C metadata=8900dc2403a2c8dd -C extra-filename=-8900dc2403a2c8dd --out-dir /tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps -C strip=debuginfo -L dependency=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps --extern aho_corasick=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libaho_corasick-6aa983f83cc1d860.rmeta --extern cached_path=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libcached_path-3ec46145dd1d130e.rmeta --extern clap=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libclap-b6f996e8d27659fd.rmeta --extern derive_builder=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libderive_builder-9b96e240da4197d9.rmeta --extern dirs=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libdirs-5779e67b580d982d.rmeta --extern esaxx_rs=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libesaxx_rs-a4589fe58879f69e.rmeta --extern getrandom=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libgetrandom-a1725e4f12011643.rmeta --extern indicatif=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libindicatif-5a25b637c223a512.rmeta --extern itertools=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libitertools-25bf26bb9d7012e3.rmeta --extern lazy_static=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/liblazy_static-efe629d64d1e110a.rmeta --extern log=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/liblog-aefa68b3bb6aa74b.rmeta --extern macro_rules_attribute=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libmacro_rules_attribute-905f7969e6855dc7.rmeta --extern monostate=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libmonostate-979232758d229ae8.rmeta --extern onig=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libonig-d57fa18c6b270e69.rmeta --extern paste=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libpaste-dcd1fc4ea32404f5.so --extern rand=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/librand-14a9ba308db49e20.rmeta --extern rayon=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/librayon-bbce6394af2ecdb4.rmeta --extern rayon_cond=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/librayon_cond-1a99da87a6ad378d.rmeta --extern regex=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libregex-abe854aac7680929.rmeta --extern regex_syntax=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libregex_syntax-bf2c82fdea1a20c9.rmeta --extern reqwest=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libreqwest-08741808824a1069.rmeta --extern serde=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libserde-48f9ebe75a8f3233.rmeta --extern serde_json=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libserde_json-722fc36ce3c5e169.rmeta --extern spm_precompiled=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libspm_precompiled-4b735c268352039e.rmeta --extern thiserror=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libthiserror-0a045910d95e7f7c.rmeta --extern unicode_normalization_alignments=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libunicode_normalization_alignments-72a662c4885161d8.rmeta --extern unicode_segmentation=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libunicode_segmentation-d45cbfa0bdea00fb.rmeta --extern unicode_categories=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libunicode_categories-c386831dea5a0d6e.rmeta -L native=/usr/lib -L native=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/build/zstd-sys-5958720fa03c9e44/out -L native=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/build/esaxx-rs-cd4e20ef7e068fc7/out -L native=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/build/onig_sys-2153c850ad2e752d/out`
      warning: variable does not need to be mutable
         --> tokenizers-lib/src/models/unigram/model.rs:265:21
          |
      265 |                 let mut target_node = &mut best_path_ends_at[key_pos];
          |                     ----^^^^^^^^^^^
          |                     |
          |                     help: remove this `mut`
          |
          = note: `#[warn(unused_mut)]` on by default

      warning: variable does not need to be mutable
         --> tokenizers-lib/src/models/unigram/model.rs:282:21
          |
      282 |                 let mut target_node = &mut best_path_ends_at[starts_at + mblen];
          |                     ----^^^^^^^^^^^
          |                     |
          |                     help: remove this `mut`

      warning: variable does not need to be mutable
         --> tokenizers-lib/src/pre_tokenizers/byte_level.rs:200:59
          |
      200 |     encoding.process_tokens_with_offsets_mut(|(i, (token, mut offsets))| {
          |                                                           ----^^^^^^^
          |                                                           |
          |                                                           help: remove this `mut`

      error: casting `&T` to `&mut T` is undefined behavior, even if the reference is unused, consider instead using an `UnsafeCell`
         --> tokenizers-lib/src/models/bpe/trainer.rs:526:47
          |
      522 |                     let w = &words[*i] as *const _ as *mut _;
          |                             -------------------------------- casting happend here
      ...
      526 |                         let word: &mut Word = &mut (*w);
          |                                               ^^^^^^^^^
          |
          = note: for more information, visit <https://doc.rust-lang.org/book/ch15-05-interior-mutability.html>
          = note: `#[deny(invalid_reference_casting)]` on by default

      warning: `tokenizers` (lib) generated 3 warnings
      error: could not compile `tokenizers` (lib) due to 1 previous error; 3 warnings emitted

      Caused by:
        process didn't exit successfully: `rustc --crate-name tokenizers --edition=2018 tokenizers-lib/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no --cfg 'feature="cached-path"' --cfg 'feature="clap"' --cfg 'feature="cli"' --cfg 'feature="default"' --cfg 'feature="dirs"' --cfg 'feature="esaxx_fast"' --cfg 'feature="http"' --cfg 'feature="indicatif"' --cfg 'feature="onig"' --cfg 'feature="progressbar"' --cfg 'feature="reqwest"' -C metadata=8900dc2403a2c8dd -C extra-filename=-8900dc2403a2c8dd --out-dir /tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps -C strip=debuginfo -L dependency=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps --extern aho_corasick=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libaho_corasick-6aa983f83cc1d860.rmeta --extern cached_path=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libcached_path-3ec46145dd1d130e.rmeta --extern clap=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libclap-b6f996e8d27659fd.rmeta --extern derive_builder=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libderive_builder-9b96e240da4197d9.rmeta --extern dirs=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libdirs-5779e67b580d982d.rmeta --extern esaxx_rs=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libesaxx_rs-a4589fe58879f69e.rmeta --extern getrandom=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libgetrandom-a1725e4f12011643.rmeta --extern indicatif=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libindicatif-5a25b637c223a512.rmeta --extern itertools=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libitertools-25bf26bb9d7012e3.rmeta --extern lazy_static=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/liblazy_static-efe629d64d1e110a.rmeta --extern log=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/liblog-aefa68b3bb6aa74b.rmeta --extern macro_rules_attribute=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libmacro_rules_attribute-905f7969e6855dc7.rmeta --extern monostate=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libmonostate-979232758d229ae8.rmeta --extern onig=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libonig-d57fa18c6b270e69.rmeta --extern paste=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libpaste-dcd1fc4ea32404f5.so --extern rand=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/librand-14a9ba308db49e20.rmeta --extern rayon=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/librayon-bbce6394af2ecdb4.rmeta --extern rayon_cond=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/librayon_cond-1a99da87a6ad378d.rmeta --extern regex=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libregex-abe854aac7680929.rmeta --extern regex_syntax=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libregex_syntax-bf2c82fdea1a20c9.rmeta --extern reqwest=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libreqwest-08741808824a1069.rmeta --extern serde=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libserde-48f9ebe75a8f3233.rmeta --extern serde_json=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libserde_json-722fc36ce3c5e169.rmeta --extern spm_precompiled=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libspm_precompiled-4b735c268352039e.rmeta --extern thiserror=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libthiserror-0a045910d95e7f7c.rmeta --extern unicode_normalization_alignments=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libunicode_normalization_alignments-72a662c4885161d8.rmeta --extern unicode_segmentation=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libunicode_segmentation-d45cbfa0bdea00fb.rmeta --extern unicode_categories=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/deps/libunicode_categories-c386831dea5a0d6e.rmeta -L native=/usr/lib -L native=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/build/zstd-sys-5958720fa03c9e44/out -L native=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/build/esaxx-rs-cd4e20ef7e068fc7/out -L native=/tmp/pip-install-_9lczfk8/tokenizers_a626b57540ed48ed8ef6ce337e9f06c5/target/release/build/onig_sys-2153c850ad2e752d/out` (exit status: 1)
      error: `cargo rustc --lib --message-format=json-render-diagnostics --manifest-path Cargo.toml --release -v --features pyo3/extension-module --crate-type cdylib --` failed with code 101
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects
Arondight commented 1 month ago

oh i see this fix in lastest code

robert-irelan-tiktokusds commented 1 month ago

For future reference, I think this was fixed in commit 4322056e6e434e4b49dc1d02dac3a51ccf6bcf21