TGI fails with local LORA adapters

p-davidk commented 3 months ago

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Error overview

I am using TGI 2.1.1 via a docker container. When I try to run with local LORA adapters, the model fails to load. I am launching with the following command:

Launch command

docker run --gpus '"device=2,3"' --shm-size 1g -p 8000:80 -v /opt/:/data ghcr.io/huggingface/text-generation-inference:2.1.1 --model-id /data/Mixtral-8x7B-v0.1 --num-shard 2 --max-input-length 30000 --max-total-tokens 32000 --max-batch-total-tokens 1024000  --dtype bfloat16 --lora-adapters /data/pfizer2b-Mixtral8x7-07-16-24-1959-david-07-16-24-v1/checkpoint-1975,/data/pfizer2b-Mixtral8x7-07-16-24-1959-david-07-16-24-v1/checkpoint-1693

Error trace

When I do this, I see the following error trace:

2024-07-19T00:21:56.330730Z  INFO text_generation_launcher: Trying to load a Peft model. It might take a while without feedback
Error: DownloadError
2024-07-19T00:21:56.997596Z ERROR download: text_generation_launcher: Download encountered an error: 
Traceback (most recent call last):

  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 399, in cached_file
    resolved_file = hf_hub_download(

  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
    validate_repo_id(arg_value)

  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
    raise HFValidationError(

huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/opt/Mixtral-8x7B-v0.1/'. Use `repo_type` argument if needed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/peft.py", line 15, in download_and_unload_peft
    model = AutoPeftModelForCausalLM.from_pretrained(

  File "/opt/conda/lib/python3.10/site-packages/peft/auto.py", line 104, in from_pretrained
    base_model = target_class.from_pretrained(base_model_path, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 484, in from_pretrained
    resolved_config_file = cached_file(

  File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 463, in cached_file
    raise EnvironmentError(

OSError: Incorrect path_or_model_id: '/opt/Mixtral-8x7B-v0.1/'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 226, in download_weights
    utils.download_and_unload_peft(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/peft.py", line 23, in download_and_unload_peft
    model = AutoPeftModelForSeq2SeqLM.from_pretrained(

  File "/opt/conda/lib/python3.10/site-packages/peft/auto.py", line 88, in from_pretrained
    raise ValueError(

ValueError: Expected target PEFT class: PeftModelForCausalLM, but you have asked for: PeftModelForSeq2SeqLM make sure that you are loading the correct model for your task type.

Additional info

It appears that there are two errors here: 1) TGI is trying to load my local adapter from a repo, which fails 2) TGI thinks one of the models is Seq2Seq instead of CausalLM.

Issue (2) doesn't make sense because the configs of the LORAs and the original model all show "task_type":"CAUSAL_LM". An example config from an adapter is below:

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "/opt/Mixtral-8x7B-v0.1/",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layers_pattern": null,
  "layers_to_transform": null,
  "lora_alpha": 32,
  "lora_dropout": 0.1,
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 256,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "v_proj",
    "o_proj",
    "k_proj",
    "w3",
    "w2",
    "q_proj",
    "w1",
    "gate"
  ],
  "task_type": "CAUSAL_LM"
}

All configs have this same format since they are from different checkpoints of the same finetuned model.

Expected behavior

I expect the script to launch a model endpoint at port 8080. I then expect to be able to switch between adapters with the "adapter" keyword argument in the text-generation python client.

ErikKaum commented 2 months ago

Hi @p-davidk 👋

Thanks for reporting this. I think we unfortunately wont be able to jump in to debug this now. If you find any more clues on what could be going on please feel free to update us here in the issue.

I'll also tag @drbh since he probably knows this part better than I do 👍

imran3180 commented 2 months ago

Support for local LORA adapters are not released in the TGI as of now. This is the Merge Request #2193 .

After this change will be released you should be able to use it as below as per description LORA_ADAPTERS=predibase/dbpedia,myadapter=/path/to/dir/

or

--lora-adapters predibase/dbpedia,myadapter=/path/to/dir/

imran3180 commented 2 months ago

Similar Issue: Issue: #2143

ErikKaum commented 2 months ago

Thanks for adding the context @imran3180 💪

nbroad1881 commented 3 weeks ago

I tried this today using sha-9263817 which is > 2.3.0; still didn't work. It said, Repository Not Found for url: https://huggingface.co/api/models/data/phi3-adapter.

huggingface-cli download microsoft/Phi-3-mini-4k-instruct --local-dir phi3
huggingface-cli download grounded-ai/phi3-hallucination-judge --local-dir phi3-adapter

model=/data/phi3
adapter=/data/phi3-adapter
volume=$PWD

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:sha-9263817 --model-id $model --lora-adapters $adapter

Running on A100 80GB

nbroad1881 commented 3 weeks ago

To anyone arriving here looking for a solution, here is the proper way to use local lora adapters:

LORA_ADAPTERS=myadapter=/some/path/to/adapter,myadapter2=/another/path/to/adapter

curl 127.0.0.1:3000/generate \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
  "inputs": "Hello who are you?",
  "parameters": {
    "max_new_tokens": 40,
    "adapter_id": "myadapter"
  }
}'

huggingface / text-generation-inference