huggingface / optimum-nvidia

Apache License 2.0
844 stars 83 forks source link

Error for gated model access despite valid HF_TOKEN #142

Open laikhtewari opened 2 weeks ago

laikhtewari commented 2 weeks ago

I am trying to run a gated model through the pipeline API but I get a gated model access error despite having the HF_TOKEN env var set.

>>> from optimum.nvidia.pipelines import pipeline as optimum_pipeline
>>> fast_pipe = optimum_pipeline('text-generation', 'meta-llama/Llama-2-70b-chat-hf', tp=2, use_fp8=True)
[...]
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/optimum/nvidia/pipelines/__init__.py", line 93, in pipeline
    raise RuntimeError(
RuntimeError: Failed to instantiate the pipeline inferring the task for model meta-llama/Llama-2-70b-chat-hf: 401 Client Error. (Request ID: Root=1-66859d97-068675f635fb5d9f4e0b36b6;10764411-e918-4f91-8a97-ca114e65ea79)

Cannot access gated repo for url https://huggingface.co/api/models/meta-llama/Llama-2-70b-chat-hf.
Access to model meta-llama/Llama-2-70b-chat-hf is restricted. You must be authenticated to access it.
>>> import os
>>> os.environ["HF_TOKEN"]
'***' # this is my access token that has access to the model

Running with the transformers pipeline succeeds in downloading the checkpoint