huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.93k stars 779 forks source link

bc3ec39d breaks the compilation (as noted in #1355) #1359

Closed baptisterajaut closed 7 months ago

baptisterajaut commented 12 months ago

As stated, this commit breaks building the tokenizers on modern toolchains, even stable

error: casting `&T` to `&mut T` is undefined behavior, even if the reference is unused, consider instead using an `UnsafeCell`
         --> tokenizers-lib/src/models/bpe/trainer.rs:526:47
          |
      522 |                     let w = &words[*i] as *const _ as *mut _;
          |                             -------------------------------- casting happend here
      ...
      526 |                         let word: &mut Word = &mut (*w);
          |                                               ^^^^^^^^^
          |

% rustc -V rustc 1.73.0 (cc66ad468 2023-10-03)

adwaraki commented 11 months ago

Tokenizers cannot be installed for me too. It is being installed as part of the Allen-NLP package and the new version of the Rust compiler breaks it.

Installing Rust via the Rust site using their shell script installs 1.73.0 I presume and breaks the Tokenizers compilation, but installing it via Homebrew installs 1.72.1, which is works.

Narsil commented 11 months ago

Which version are you using.

This was fixed already on main and 0.14.1

https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/bpe/trainer.rs#L541-L546

Songcheng-Xie commented 11 months ago

To escape from this error, I install transformers with conda, which uses command 'conda install -c huggingface transformers'. then it works.

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

DavidAdamczyk commented 10 months ago

I have the same problem with Python 3.11 do you need more information about this issue?

Narsil commented 10 months ago

@DavidAdamczyk Use a more recent tokenizers version, or an older Rust compiler version.

DavidAdamczyk commented 9 months ago

I use the latest version of tokenizers and the most recent stable version of the Rust compiler. Additionally, I follow the installation instructions available here. Could someone update the installation instructions and include information about the supported versions of all dependencies?

Mr-AniP commented 9 months ago

Hey Hi, This same error has happened with me I am trying to install transformers v 4.6.1 on Pyng z2 board (v2.5 {arm7l}) with rust v 1.74.1

Edit: Strategy to solve this error is to use older rust version -> (What I did) 1) install rust v1.72.1 rustup default 1.72.1 2) Remove rust stable or set environment variable to make sure that compilation does not use rust stable rustup toolchain remove stable or export RUSTUP_TOOLCHAIN=1.72.1

After this It should work properly

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

davehorner commented 8 months ago

pip3 install transformers==4.15.0 timm==0.4.12 fairscale==0.4.4

  error: casting `&T` to `&mut T` is undefined behavior, even if the reference is unused, consider instead using an `UnsafeCell`
     --> tokenizers-lib\src\models\bpe\trainer.rs:517:47
      |
  513 |                     let w = &words[*i] as *const _ as *mut _;
      |                             -------------------------------- casting happend here
  ...
  517 |                         let word: &mut Word = &mut (*w);
      |                                               ^^^^^^^^^
      |
      = note: for more information, visit <https://doc.rust-lang.org/book/ch15-05-interior-mutability.html>
      = note: `#[deny(invalid_reference_casting)]` on by default

running into this tonight too.

Requirement already satisfied: requests in c:\users\dhorner\anaconda3\envs\hotz\lib\site-packages (from transformers==4.15.0->-r requirements.txt (line 2)) (2.31.0) Collecting sacremoses (from transformers==4.15.0->-r requirements.txt (line 2)) Using cached sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB) Collecting tokenizers<0.11,>=0.10.1 (from transformers==4.15.0->-r requirements.txt (line 2)) Using cached tokenizers-0.10.3.tar.gz (212 kB)

THE SOLUTION FOR ME WAS TO SET RUSTFLAGS=-A invalid_reference_casting worked for me in 1.75.0

athewsey commented 7 months ago

Also ran in to this issue last week, installing transformers==4.22.1 pinned by a different project. tokenizers resolved to v0.12.1. Platform was macOS Sonoma, M2 chip.

I also worked around by running:

export RUSTFLAGS="-A invalid_reference_casting"

...before installing, but it'd be great if the problem could be tackled at source!

davehorner commented 7 months ago

I would love to be the one to help resolve this further than a environment flag.

tokenizers-lib/src/models/bpe/trainer.rs:526

I do not see tokenizers-lib in tree. rg "let w = &words[*i] as *const _ as *mut _;" finds nothing

The error guidance is not clear. GPT says: This error message indicates that you're attempting to cast a shared reference (&T) into a mutable reference (&mut T), which is considered undefined behavior in Rust, even if the mutable reference is not actually used. Rust's safety guarantees rely on preventing such unsound operations.

To resolve this issue, you should use appropriate safe patterns for mutable access, such as Cell, RefCell, or UnsafeCell for interior mutability, depending on your specific use case.

In your case, since you're dealing with mutable access to data through raw pointers, you should consider using UnsafeCell. Here's how you can adjust your code:

use std::cell::UnsafeCell;

// Assuming Word is some struct or type you're working with
struct Word {
    // fields of Word
}

// Assuming words is some collection of Word
let words: Vec<Word> = /* initialization of words */;

// Assuming i is some index into the words vector
let i = /* index */;

// Accessing the word at index i in a mutable way
let w = &words[i] as *const _ as *mut UnsafeCell<Word>;
let word: &UnsafeCell<Word> = unsafe { &*w };
let word_mut: &mut Word = unsafe { &mut *word.get() };

However, using UnsafeCell requires careful handling as it bypasses Rust's safety checks. Make sure you understand the implications of using UnsafeCell and ensure that your code is correct and safe.

Alternatively, consider restructuring your code to avoid mutable raw pointer access if possible, as raw pointer manipulation can be error-prone and harder to reason about compared to safe Rust constructs.

so Rustonomicon.

If someone can orient me to where the code is. I don't know where it lives.

ArthurZucker commented 7 months ago

I'll close this as the latest releases don't have this issue anymore I believe