huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.93k stars 779 forks source link

Can not load tokoenizer from_pretrained through http_proxy since 0.14.0 #1373

Closed jtsai-quid closed 8 months ago

jtsai-quid commented 11 months ago

Hi hf,

I encountered an issue where I couldn't load the tokenizer using from_pretrained via the http_proxy in version 0.14.0, while it worked successfully in version 0.13.3. This caused the fast tokenizer initialization issue in TGI 1.1.0. https://github.com/huggingface/text-generation-inference/issues/1108

Here is the code snippet that I use to test for testing.

//# tokenizers = { version = "0.14.0", features = ["http"] }

use tokenizers::tokenizer::{Result, Tokenizer};
use tokenizers::{FromPretrainedParameters};

fn main() -> Result<()> {
        let authorization_token = std::env::var("HUGGING_FACE_HUB_TOKEN").ok();
        let params = FromPretrainedParameters {
            revision: None.clone().unwrap_or("main".to_string()),
            auth_token: authorization_token.clone(),
            ..Default::default()
        };

        let tokenizer = Tokenizer::from_pretrained("TheBloke/Llama-2-13B-chat-GPTQ", Some(params))?;

        let encoding = tokenizer.encode("Hey there!", false)?;
        println!("{:?}", encoding.get_tokens());
    Ok(())
}

Error output

> http_proxy=http://squid:3128 https_proxy=http://squid:3128 cargo play run.rs
   Compiling p4u7iybabtwyzvxf2zdtkustjgod2 v0.1.0 (/tmp/cargo-play.4U7iybABTwyZVxF2ZDTKUstjgod2)
    Finished dev [unoptimized + debuginfo] target(s) in 3.14s
     Running `/tmp/cargo-play.4U7iybABTwyZVxF2ZDTKUstjgod2/target/debug/p4u7iybabtwyzvxf2zdtkustjgod2`
Error: RequestError(Transport(Transport { kind: Io, message: None, url: Some(Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("huggingface.co")), port: None, path: "/TheBloke/Llama-2-13B-chat-GPTQ/resolve/main/tokenizer.json", query: None, fragment: None }), source: Some(Custom { kind: TimedOut, error: "timed out reading response" }) }))

I suspect that this is related to the client refactoring in here

Thanks and appreciate for any help from you!

ArthurZucker commented 11 months ago

Indeed. Could you try with the latest release? Otherwise I'll have look at what I can do!

jtsai-quid commented 11 months ago

Just try the version 0.14.1 and the error still occurs. 😞

jtsai-quid commented 11 months ago

hi @ArthurZucker , Would this PR fix this issue? https://github.com/huggingface/hf-hub/pull/34

ArthurZucker commented 11 months ago

Ah! Yeah most probably because now we use the hf-hub api to load files, so if proxy is an issue there, will affect us.

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

jtsai-quid commented 10 months ago

hi @ArthurZucker , I have noticed hf-hub has fixed this issue. https://github.com/huggingface/hf-hub/pull/34 Would it be possible to use the latest version of hf-hub in the tokenizer? Thanks~

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.