Closed KaifAhmad1 closed 8 months ago
Hi @KaifAhmad1, thanks for raising this issue!
Hm, that's weird. I'm able to run the snippet without issue after getting access.
In what environment are you running this code e.g. python session, jupyter notebook?
For python sessions, I'd recommend logging in through the CLI first using huggingface-cli login
to make sure your token is available in your environment (you shouldn't need to pass it in with use_auth_token
) or my logging in in the session with:
from huggingface_hub import login
login()
On a jupyter notebook you can try:
from huggingface_hub import notebook_login
notebook_login()
Let me know if any of these helped or if there's still an issue.
Hey, @amyeroberts @younesbelkada After running this script now I am getting another exception.
Using latest versions of bitsandbytes and accelerate but still getting this exception
bitsandbytes = 0.42.0 accelerate = 0.27.2
!pip install -qU transformers
!pip install -qU langchain
!pip install -qU huggingface_hub
!pip install -qU tiktoken
!pip install -qU neo4j
!pip install -qU python-dotenv
!pip install -qU sentence_transformers
!pip install -qU optimum
!pip install -qU unstructured unstructured[pdf]
!pip install -qU bitsandbytes
!pip install -qU accelerate
import torch
from torch import cuda, bfloat16
import transformers
model_id = 'google/gemma-7b'
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
model_config = transformers.AutoConfig.from_pretrained(
model_id,
)
# BnB Configuration
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
config=model_config,
device_map='auto',
attn_implementation="flash_attention_2",
quantization_config=bnb_config,
low_cpu_mem_usage=True
)
ImportError Traceback (most recent call last)
2 frames
/usr/local/lib/python3.10/dist-packages/transformers/quantizers/quantizer_bnb_4bit.py in validate_environment(self, *args, *kwargs)
60 def validate_environment(self, args, **kwargs):
61 if not (is_accelerate_available() and is_bitsandbytes_available()):
---> 62 raise ImportError(
63 "Using bitsandbytes
8-bit quantization requires Accelerate: pip install accelerate
"
64 "and the latest version of bitsandbytes: pip install -i https://pypi.org/simple/ bitsandbytes
"
ImportError: Using bitsandbytes
8-bit quantization requires Accelerate: pip install accelerate
and the latest version of bitsandbytes: pip install -i https://pypi.org/simple/ bitsandbytes
NOTE: If your import is failing due to a missing package, you can manually install dependencies using either !pip or !apt.
Hi @KaifAhmad1,
Huh, that's funny. The code being run is for 4bit, so it's weird the error is about 8bit quantization. Two questions:
Hi @amyeroberts, alright Error with flash attention attribute fixed. Closing the issue now.
Thanks!
Hey, @amyeroberts @younesbelkada Now getting this error
flash-attn = 2.5.5 transformers: 4.38.1
# Set up text generation pipeline
generate_text = transformers.pipeline(
model=model,
tokenizer=tokenizer,
return_full_text=True,
task='text-generation',
stopping_criteria=stopping_criteria,
temperature=0.3,
max_new_tokens=512,
repetition_penalty=1.1
)
result = generate_text("What are the primary mechanisms underlying antibiotic resistance, and how can we develop strategies to combat it?")
print(result)
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:410: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.3` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
warnings.warn(
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
[<ipython-input-19-cab67dc592cd>](https://localhost:8080/#) in <cell line: 1>()
----> 1 result = generate_text("What are the primary mechanisms underlying antibiotic resistance, and how can we develop strategies to combat it?")
2 print(result)
28 frames
[/usr/local/lib/python3.10/dist-packages/flash_attn/flash_attn_interface.py](https://localhost:8080/#) in _flash_attn_forward(q, k, v, dropout_p, softmax_scale, causal, window_size, alibi_slopes, return_softmax)
49 maybe_contiguous = lambda x: x.contiguous() if x.stride(-1) != 1 else x
50 q, k, v = [maybe_contiguous(x) for x in (q, k, v)]
---> 51 out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
52 q,
53 k,
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
Hi @KaifAhmad1 ! Flash attention only support Ampere GPUs (A10, A100s, etc.) or newer, what gpu are you using?
Hey, @younesbelkada Using Tesla T4 any other alternative you can suggest
Tesla T4 is not supported for Flash Attention unfortunately. Please consider using SDPA attn_implementation="sdpa"
in from_pretrained
for more memory efficient training or inference
Hey, @younesbelkada
Any other inference optimization tech you can suggest for low GPU memory usage.
I have tried optimum
and bettertransformers
but not support to this model.
Thanks @KaifAhmad1 !
For BetterTransformers it is not supported because BetterTransformer is SDPA itself - so both are the same :)
You can combine quantization + SDPA load_in_4bit=True
+ attn_implementation="sdpa"
- more optimizations are coming soon e.g. https://github.com/huggingface/transformers/pull/29023
Thanks @younesbelkada for helping me out.
Thanks @KaifAhmad1 !
Hey, @younesbelkada Now getting another error. torch = 2.1.0+cu121 transformers = 4.38.1
# BnB Configuration
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
config=model_config,
device_map='auto',
attn_implementation="sdpa",
quantization_config=bnb_config,
low_cpu_mem_usage=True
)
model.safetensors.index.json:β100%
β20.9k/20.9kβ[00:00<00:00,β1.03MB/s]
Downloadingβshards:β100%
β4/4β[02:35<00:00,β35.31s/it]
model-00001-of-00004.safetensors:β100%
β5.00G/5.00Gβ[00:47<00:00,β77.9MB/s]
model-00002-of-00004.safetensors:β100%
β4.98G/4.98Gβ[00:46<00:00,β198MB/s]
model-00003-of-00004.safetensors:β100%
β4.98G/4.98Gβ[00:37<00:00,β52.8MB/s]
model-00004-of-00004.safetensors:β100%
β2.11G/2.11Gβ[00:23<00:00,β63.7MB/s]
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
[<ipython-input-8-2ca992991bb4>](https://localhost:8080/#) in <cell line: 1>()
----> 1 model = transformers.AutoModelForCausalLM.from_pretrained(
2 model_id,
3 config=model_config,
4 device_map='auto',
5 attn_implementation="sdpa",
3 frames
[/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py](https://localhost:8080/#) in _check_and_enable_sdpa(cls, config, hard_check_only)
1529 )
1530 if not is_torch_sdpa_available():
-> 1531 raise ImportError(
1532 "PyTorch SDPA requirements in Transformers are not met. Please install torch>=2.1.1."
1533 )
ImportError: PyTorch SDPA requirements in Transformers are not met. Please install torch>=2.1.1.
PyTorch SDPA requirements in Transformers are not met. Please install torch>=2.1.1 if you want to use sdpa :)
Hi @amyeroberts, alright Error with flash attention attribute fixed. Closing the issue now.
Thanks!
Hi, I facing the same issue. Let me know please how you solve this issue. Thanks in advance!
whats wrong in my code, Im not getting where to have my token placed.
Code : origin_model_path = "mistralai/Mistral-7B-Instruct-v0.1" model_path = "filipealmeida/Mistral-7B-Instruct-v0.1-sharded" bnb_config = BitsAndBytesConfig \ ( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained (model_path, trust_remote_code=True,
quantization_config=bnb_config,
low_cpu_mem_usage=True
)
tokenizer = AutoTokenizer.from_pretrained(origin_model_path, token="
Error: OSError: You are trying to access a gated repo. Make sure to have access to it at https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1. 403 Client Error. (Request ID: Root=1-6639cc90-7c11e22d3241ff0d5ed97f20;03dae708-49e4-4e2d-8619-4c820cfa51c0)
Cannot access gated repo for url https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1/resolve/main/config.json. Access to model mistralai/Mistral-7B-Instruct-v0.1 is restricted and you are not in the authorized list. Visit https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 to ask for access.
Hi @sona-16 - please see this guide on how to authenticate for using the hub: https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication
You can also pass the token directly on the from_pretrained call: https://huggingface.co/docs/transformers/v4.40.2/en/main_classes/model#transformers.PreTrainedModel.from_pretrained.token
from huggingface_hub import login login('your_token_key_here')
This fixed the error for me!
@amyeroberts @KaifAhmad1 @ArthurZucker I get the error for Llama3 in my Jupiter notebook even though I can successfully login either by cli command !huggingface-cli login --token "MyToken" or
from huggingface_hub import notebook_login notebook_login()
the error is: OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like meta-llama/Meta-Llama-3-8B is not the path to a directory containing a file named config.json. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'
for this code
model_name = "meta-llama/Meta-Llama-3-8B"
model = AutoModelForSequenceClassification.from_pretrained( model_name, quantization_config=quantization_config, num_labels=3, device_map='auto' )
@SaraAmd Are you able to load other models, other than "meta-llama/Meta-Llama-3-8B"
?
You can also pass the token directly on the from_pretrained call: https://huggingface.co/docs/transformers/v4.40.2/en/main_classes/model#transformers.PreTrainedModel.from_pretrained.token
This is not working, you still get the "not authorised" response. What worked is, as mentioned above:
from huggingface_hub import login
login('hf_SECRET')
hi, I have met some problems when I tried to use the LlaMA model in HF. The error is :
OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Meta-Llama-3-8B.
403 Client Error. (Request ID: Root=1-66adafd2-55928817165ad3fe73c38472;1b39ba82-98e0-4524-835c-e63c5009fb2b)
Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3-8B/resolve/main/config.json.
Access to model meta-llama/Meta-Llama-3-8B is restricted and you are not in the authorized list. Visit https://huggingface.co/meta-llama/Meta-Llama-3-8B to ask for access.
I have already imported "login" from "huggingface_hub" and accessed successfully to login by using my token. This is my code:
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import login
model_path = "meta-llama/Meta-Llama-3-8B"
login(token="my token")
tokenizer = AutoTokenizer.from_pretrained(model_path,
use_auth_token=True,
)
model = AutoModelForCausalLM.from_pretrained(model_path,
use_auth_token=True,
)
print("success")
How can I fix this bug?
you are not in the authorized lis
Are you sure you have access to it? π
you are not in the authorized lis
Are you sure you have access to it? π
ππ
Are you perhaps in china / using a firewall?
Are you perhaps in china / using a firewall?
Yes, but I use a new network node to download the model, which is not restricted by firewall...
cc @Wauplin sorry I forgot what's the usual solution for this!
@Killerofthecard have you set your proxy as environment variables? (like this).
Also, are you able to download a model that is non gated? For example: BAAI/bge-reranker-v2-m3
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-reranker-v2-m3")
model = AutoModelForSequenceClassification.from_pretrained("BAAI/bge-reranker-v2-m3")
asking to check if the problem is really about authentication or not
Model description
I have submit access request to through huggingface and granted me access but not able to run model on inference.
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py:1096: FutureWarning: The
use_auth_token
argument is deprecated and will be removed in v5 of Transformers. Please usetoken
instead. warnings.warn(HTTPError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name) 285 try: --> 286 response.raise_for_status() 287 except HTTPError as e:
14 frames HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/google/gemma-7b/resolve/main/config.json
The above exception was the direct cause of the following exception:
GatedRepoError Traceback (most recent call last) GatedRepoError: 403 Client Error. (Request ID: Root=1-65d60dc7-2ab7a6ca2c4e9a5a5719a779;7cd21b46-4ebb-4ad6-b147-4eb110a4f7e0)
Cannot access gated repo for url https://huggingface.co/google/gemma-7b/resolve/main/config.json. Access to model google/gemma-7b is restricted and you are not in the authorized list. Visit https://huggingface.co/google/gemma-7b to ask for access.
The above exception was the direct cause of the following exception:
OSError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py in cached_file(path_or_repo_id, filename, cache_dir, force_download, resume_download, proxies, token, revision, local_files_only, subfolder, repo_type, user_agent, _raise_exceptions_for_gated_repo, _raise_exceptions_for_missing_entries, _raise_exceptions_for_connection_errors, _commit_hash, **deprecated_kwargs) 414 if resolved_file is not None or not _raise_exceptions_for_gated_repo: 415 return resolved_file --> 416 raise EnvironmentError( 417 "You are trying to access a gated repo.\nMake sure to have access to it at " 418 f"https://huggingface.co/{path_or_repo_id}.\n{str(e)}"
OSError: You are trying to access a gated repo. Make sure to have access to it at https://huggingface.co/google/gemma-7b. 403 Client Error. (Request ID: Root=1-65d60dc7-2ab7a6ca2c4e9a5a5719a779;7cd21b46-4ebb-4ad6-b147-4eb110a4f7e0)
Cannot access gated repo for url https://huggingface.co/google/gemma-7b/resolve/main/config.json. Access to model google/gemma-7b is restricted and you are not in the authorized list. Visit https://huggingface.co/google/gemma-7b to ask for access.
Open source status
Provide useful links for the implementation
No response