I am trying to run a gated model through the pipeline API but I get a gated model access error despite having the HF_TOKEN env var set.
>>> from optimum.nvidia.pipelines import pipeline as optimum_pipeline
>>> fast_pipe = optimum_pipeline('text-generation', 'meta-llama/Llama-2-70b-chat-hf', tp=2, use_fp8=True)
[...]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.10/dist-packages/optimum/nvidia/pipelines/__init__.py", line 93, in pipeline
raise RuntimeError(
RuntimeError: Failed to instantiate the pipeline inferring the task for model meta-llama/Llama-2-70b-chat-hf: 401 Client Error. (Request ID: Root=1-66859d97-068675f635fb5d9f4e0b36b6;10764411-e918-4f91-8a97-ca114e65ea79)
Cannot access gated repo for url https://huggingface.co/api/models/meta-llama/Llama-2-70b-chat-hf.
Access to model meta-llama/Llama-2-70b-chat-hf is restricted. You must be authenticated to access it.
>>> import os
>>> os.environ["HF_TOKEN"]
'***' # this is my access token that has access to the model
Running with the transformers pipeline succeeds in downloading the checkpoint
I am trying to run a gated model through the pipeline API but I get a gated model access error despite having the HF_TOKEN env var set.
Running with the transformers pipeline succeeds in downloading the checkpoint