OSError: You are trying to access a gated repo.

KaifAhmad1 commented 8 months ago

Model description

I have submit access request to through huggingface and granted me access but not able to run model on inference.

import torch
from torch import cuda, bfloat16
import transformers
model_id = 'google/gemma-7b'
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# begin initializing HF items, you need an access token
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py:1096: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead. warnings.warn(

HTTPError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name) 285 try: --> 286 response.raise_for_status() 287 except HTTPError as e:

14 frames HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/google/gemma-7b/resolve/main/config.json

The above exception was the direct cause of the following exception:

GatedRepoError Traceback (most recent call last) GatedRepoError: 403 Client Error. (Request ID: Root=1-65d60dc7-2ab7a6ca2c4e9a5a5719a779;7cd21b46-4ebb-4ad6-b147-4eb110a4f7e0)

Cannot access gated repo for url https://huggingface.co/google/gemma-7b/resolve/main/config.json. Access to model google/gemma-7b is restricted and you are not in the authorized list. Visit https://huggingface.co/google/gemma-7b to ask for access.

The above exception was the direct cause of the following exception:

OSError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_gated_repo, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash, **deprecated_kwargs) 414 if resolved_file is not None or not _raise_exceptions_for_gated_repo: 415 return resolved_file --> 416 raise EnvironmentError( 417 "You are trying to access a gated repo.\nMake sure to have access to it at " 418 f"https://huggingface.co/{path_or_repo_id}.\n{str(e)}"

OSError: You are trying to access a gated repo. Make sure to have access to it at https://huggingface.co/google/gemma-7b. 403 Client Error. (Request ID: Root=1-65d60dc7-2ab7a6ca2c4e9a5a5719a779;7cd21b46-4ebb-4ad6-b147-4eb110a4f7e0)

Cannot access gated repo for url https://huggingface.co/google/gemma-7b/resolve/main/config.json. Access to model google/gemma-7b is restricted and you are not in the authorized list. Visit https://huggingface.co/google/gemma-7b to ask for access.

FireShot Capture 002 - google_gemma-7b · Hugging Face - huggingface co

Open source status

[X] The model implementation is available
[X] The model weights are available

Provide useful links for the implementation

No response

amyeroberts commented 8 months ago

Hi @KaifAhmad1, thanks for raising this issue!

Hm, that's weird. I'm able to run the snippet without issue after getting access.

In what environment are you running this code e.g. python session, jupyter notebook?

For python sessions, I'd recommend logging in through the CLI first using huggingface-cli login to make sure your token is available in your environment (you shouldn't need to pass it in with use_auth_token) or my logging in in the session with:

from huggingface_hub import login
login()

On a jupyter notebook you can try:

from huggingface_hub import notebook_login
notebook_login()

Let me know if any of these helped or if there's still an issue.

KaifAhmad1 commented 8 months ago

Hey, @amyeroberts @younesbelkada After running this script now I am getting another exception.

Using latest versions of bitsandbytes and accelerate but still getting this exception

bitsandbytes = 0.42.0 accelerate = 0.27.2

!pip install -qU transformers
!pip install -qU langchain
!pip install -qU huggingface_hub
!pip install -qU tiktoken
!pip install -qU neo4j
!pip install -qU python-dotenv
!pip install -qU sentence_transformers
!pip install -qU  optimum
!pip install -qU unstructured unstructured[pdf]
!pip install -qU  bitsandbytes
!pip install -qU accelerate

import torch
from torch import cuda, bfloat16
import transformers
model_id = 'google/gemma-7b'
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

model_config = transformers.AutoConfig.from_pretrained(
    model_id,
)

# BnB Configuration
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    config=model_config,
    device_map='auto',
    attn_implementation="flash_attention_2",
    quantization_config=bnb_config,
    low_cpu_mem_usage=True
)

ImportError Traceback (most recent call last) in <cell line: 1>() ----> 1 model = transformers.AutoModelForCausalLM.from_pretrained( 2 model_id, 3 config=model_config, 4 device_map='auto', 5 attn_implementation="flash_attention_2",

2 frames /usr/local/lib/python3.10/dist-packages/transformers/quantizers/quantizer_bnb_4bit.py in validate_environment(self, *args, *kwargs) 60 def validate_environment(self, args, **kwargs): 61 if not (is_accelerate_available() and is_bitsandbytes_available()): ---> 62 raise ImportError( 63 "Using bitsandbytes 8-bit quantization requires Accelerate: pip install accelerate " 64 "and the latest version of bitsandbytes: pip install -i https://pypi.org/simple/ bitsandbytes"

ImportError: Using bitsandbytes 8-bit quantization requires Accelerate: pip install accelerate and the latest version of bitsandbytes: pip install -i https://pypi.org/simple/ bitsandbytes

NOTE: If your import is failing due to a missing package, you can manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the "Open Examples" button below.

amyeroberts commented 8 months ago

Hi @KaifAhmad1,

Huh, that's funny. The code being run is for 4bit, so it's weird the error is about 8bit quantization. Two questions:

Which version of transformers are you running from?
What device are you running on?

KaifAhmad1 commented 8 months ago

Hi @amyeroberts, alright Error with flash attention attribute fixed. Closing the issue now.

Thanks!

KaifAhmad1 commented 8 months ago

Hey, @amyeroberts @younesbelkada Now getting this error

flash-attn = 2.5.5 transformers: 4.38.1

# Set up text generation pipeline
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    task='text-generation',
    stopping_criteria=stopping_criteria,
    temperature=0.3,
    max_new_tokens=512,
    repetition_penalty=1.1
)

result = generate_text("What are the primary mechanisms underlying antibiotic resistance, and how can we develop strategies to combat it?")
print(result)

/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:410: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.3` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-19-cab67dc592cd>](https://localhost:8080/#) in <cell line: 1>()
----> 1 result = generate_text("What are the primary mechanisms underlying antibiotic resistance, and how can we develop strategies to combat it?")
      2 print(result)

28 frames
[/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py](https://localhost:8080/#) in _flash_attn_forward(q, k, v, dropout_p, softmax_scale, causal, window_size, alibi_slopes, return_softmax)
     49     maybe_contiguous = lambda x: x.contiguous() if x.stride(-1) != 1 else x
     50     q, k, v = [maybe_contiguous(x) for x in (q, k, v)]
---> 51     out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
     52         q,
     53         k,

RuntimeError: FlashAttention only supports Ampere GPUs or newer.

younesbelkada commented 8 months ago

Hi @KaifAhmad1 ! Flash attention only support Ampere GPUs (A10, A100s, etc.) or newer, what gpu are you using?

KaifAhmad1 commented 8 months ago

Hey, @younesbelkada Using Tesla T4 any other alternative you can suggest

younesbelkada commented 8 months ago

Tesla T4 is not supported for Flash Attention unfortunately. Please consider using SDPA attn_implementation="sdpa" in from_pretrained for more memory efficient training or inference

KaifAhmad1 commented 8 months ago

Hey, @younesbelkada Any other inference optimization tech you can suggest for low GPU memory usage. I have tried optimum and bettertransformers but not support to this model.

younesbelkada commented 8 months ago

Thanks @KaifAhmad1 ! For BetterTransformers it is not supported because BetterTransformer is SDPA itself - so both are the same :) You can combine quantization + SDPA load_in_4bit=True + attn_implementation="sdpa" - more optimizations are coming soon e.g. https://github.com/huggingface/transformers/pull/29023

KaifAhmad1 commented 8 months ago

Thanks @younesbelkada for helping me out.

younesbelkada commented 8 months ago

Thanks @KaifAhmad1 !

KaifAhmad1 commented 8 months ago

Hey, @younesbelkada Now getting another error. torch = 2.1.0+cu121 transformers = 4.38.1

# BnB Configuration
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    config=model_config,
    device_map='auto',
    attn_implementation="sdpa",
    quantization_config=bnb_config,
    low_cpu_mem_usage=True
)

model.safetensors.index.json: 100%
 20.9k/20.9k [00:00<00:00, 1.03MB/s]
Downloading shards: 100%
 4/4 [02:35<00:00, 35.31s/it]
model-00001-of-00004.safetensors: 100%
 5.00G/5.00G [00:47<00:00, 77.9MB/s]
model-00002-of-00004.safetensors: 100%
 4.98G/4.98G [00:46<00:00, 198MB/s]
model-00003-of-00004.safetensors: 100%
 4.98G/4.98G [00:37<00:00, 52.8MB/s]
model-00004-of-00004.safetensors: 100%
 2.11G/2.11G [00:23<00:00, 63.7MB/s]
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
[<ipython-input-8-2ca992991bb4>](https://localhost:8080/#) in <cell line: 1>()
----> 1 model = transformers.AutoModelForCausalLM.from_pretrained(
      2     model_id,
      3     config=model_config,
      4     device_map='auto',
      5     attn_implementation="sdpa",

3 frames
[/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py](https://localhost:8080/#) in _check_and_enable_sdpa(cls, config, hard_check_only)
   1529                 )
   1530             if not is_torch_sdpa_available():
-> 1531                 raise ImportError(
   1532                     "PyTorch SDPA requirements in Transformers are not met. Please install torch>=2.1.1."
   1533                 )

ImportError: PyTorch SDPA requirements in Transformers are not met. Please install torch>=2.1.1.

ArthurZucker commented 8 months ago

PyTorch SDPA requirements in Transformers are not met. Please install torch>=2.1.1 if you want to use sdpa :)

khanifah-gif commented 7 months ago

Hi @amyeroberts, alright Error with flash attention attribute fixed. Closing the issue now.

Thanks!

Hi, I facing the same issue. Let me know please how you solve this issue. Thanks in advance!

sona-16 commented 6 months ago

whats wrong in my code, Im not getting where to have my token placed.

Code : origin_model_path = "mistralai/Mistral-7B-Instruct-v0.1" model_path = "filipealmeida/Mistral-7B-Instruct-v0.1-sharded" bnb_config = BitsAndBytesConfig \ ( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16,

model = AutoModelForCausalLM.from_pretrained (model_path, trust_remote_code=True, quantization_config=bnb_config, low_cpu_mem_usage=True ) tokenizer = AutoTokenizer.from_pretrained(origin_model_path, token="")

Error: OSError: You are trying to access a gated repo. Make sure to have access to it at https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1. 403 Client Error. (Request ID: Root=1-6639cc90-7c11e22d3241ff0d5ed97f20;03dae708-49e4-4e2d-8619-4c820cfa51c0)

Cannot access gated repo for url https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1/resolve/main/config.json. Access to model mistralai/Mistral-7B-Instruct-v0.1 is restricted and you are not in the authorized list. Visit https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 to ask for access.

amyeroberts commented 6 months ago

Hi @sona-16 - please see this guide on how to authenticate for using the hub: https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication

You can also pass the token directly on the from_pretrained call: https://huggingface.co/docs/transformers/v4.40.2/en/main_classes/model#transformers.PreTrainedModel.from_pretrained.token

SwaroopBaibhav commented 5 months ago

from huggingface_hub import login login('your_token_key_here')

This fixed the error for me!

SaraAmd commented 4 months ago

@amyeroberts @KaifAhmad1 @ArthurZucker I get the error for Llama3 in my Jupiter notebook even though I can successfully login either by cli command !huggingface-cli login --token "MyToken" or

from huggingface_hub import notebook_login notebook_login()

the error is: OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like meta-llama/Meta-Llama-3-8B is not the path to a directory containing a file named config.json. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'

for this code

model_name = "meta-llama/Meta-Llama-3-8B"

model = AutoModelForSequenceClassification.from_pretrained( model_name, quantization_config=quantization_config, num_labels=3, device_map='auto' )

amyeroberts commented 4 months ago

@SaraAmd Are you able to load other models, other than "meta-llama/Meta-Llama-3-8B"?

nickyreinert commented 4 months ago

You can also pass the token directly on the from_pretrained call: https://huggingface.co/docs/transformers/v4.40.2/en/main_classes/model#transformers.PreTrainedModel.from_pretrained.token

This is not working, you still get the "not authorised" response. What worked is, as mentioned above:

from huggingface_hub import login
login('hf_SECRET')

Killerofthecard commented 3 months ago

hi, I have met some problems when I tried to use the LlaMA model in HF. The error is :

OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Meta-Llama-3-8B.
403 Client Error. (Request ID: Root=1-66adafd2-55928817165ad3fe73c38472;1b39ba82-98e0-4524-835c-e63c5009fb2b)

Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3-8B/resolve/main/config.json.
Access to model meta-llama/Meta-Llama-3-8B is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Meta-Llama-3-8B to ask for access.

I have already imported "login" from "huggingface_hub" and accessed successfully to login by using my token. This is my code:

from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login

model_path = "meta-llama/Meta-Llama-3-8B"
login(token="my token")
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          use_auth_token=True,
                                          )
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             use_auth_token=True,
                                             )
print("success")

How can I fix this bug?

ArthurZucker commented 3 months ago

you are not in the authorized lis

Are you sure you have access to it? 😓

Killerofthecard commented 3 months ago

you are not in the authorized lis

Are you sure you have access to it? 😓

Yes, I just got the URL from meta. I tried to run the .sh code in Llama 3 repo according to their instruction, but, again, the error is met.

Proxy request sent, awaiting response... 403 Forbidden 2024-08-06 22:10:35 ERROR 403: Forbidden.

😓😓

ArthurZucker commented 2 months ago

Are you perhaps in china / using a firewall?

Killerofthecard commented 2 months ago

Are you perhaps in china / using a firewall?

Yes, but I use a new network node to download the model, which is not restricted by firewall...

ArthurZucker commented 2 months ago

cc @Wauplin sorry I forgot what's the usual solution for this!

Wauplin commented 2 months ago

@Killerofthecard have you set your proxy as environment variables? (like this).

Also, are you able to download a model that is non gated? For example: BAAI/bge-reranker-v2-m3

# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-reranker-v2-m3")
model = AutoModelForSequenceClassification.from_pretrained("BAAI/bge-reranker-v2-m3")

asking to check if the problem is really about authentication or not

huggingface / transformers