FileNotFoundError following Linux Docker GPU instructions

rahimnathwani commented 1 year ago

The Linux GPU instructions say:

An example of running h2oGPT via docker using AutoGPTQ (4-bit, so using less GPU memory) with LLaMa2 7B model is:

mkdir -p $HOME/.cache
mkdir -p $HOME/save
export CUDA_VISIBLE_DEVICES=0
docker run \
       --gpus all \
       --runtime=nvidia \
       --shm-size=2g \
       -p 7860:7860 \
       --rm --init \
       --network host \
       -v /etc/passwd:/etc/passwd:ro \
       -v /etc/group:/etc/group:ro \
       -u `id -u`:`id -g` \
       -v "${HOME}"/.cache:/workspace/.cache \
       -v "${HOME}"/save:/workspace/save \
       -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
       gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \
          --base_model=TheBloke/Llama-2-7b-Chat-GPTQ \
          --load_gptq="gptq_model-4bit-128g" \
          --use_safetensors=True \
          --prompt_type=llama2 \
          --save_dir='/workspace/save/' \
          --use_gpu_id=False \
          --score_model=None \
          --max_max_new_tokens=2048 \
          --max_new_tokens=1024

When I run this I get the following error:

WARNING: Published ports are discarded when using host network mode
Using Model thebloke/llama-2-7b-chat-gptq
Starting get_model: TheBloke/Llama-2-7b-Chat-GPTQ 
/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
/h2ogpt_conda/lib/python3.10/site-packages/transformers/utils/hub.py:374: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
Traceback (most recent call last):
  File "/workspace/generate.py", line 16, in <module>
    entrypoint_main()
  File "/workspace/generate.py", line 12, in entrypoint_main
    H2O_Fire(main)
  File "/workspace/src/utils.py", line 59, in H2O_Fire
    fire.Fire(component=component, command=args)
  File "/h2ogpt_conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/h2ogpt_conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/h2ogpt_conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/src/gen.py", line 1162, in main
    model0, tokenizer0, device = get_model(reward_type=False,
  File "/workspace/src/gen.py", line 1639, in get_model
    return get_hf_model(load_8bit=load_8bit,
  File "/workspace/src/gen.py", line 1820, in get_hf_model
    model = model_loader(
  File "/h2ogpt_conda/lib/python3.10/site-packages/auto_gptq/modeling/auto.py", line 108, in from_quantized
    return quant_func(
  File "/h2ogpt_conda/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 791, in from_quantized
    raise FileNotFoundError(f"Could not find model in {model_name_or_path}")
FileNotFoundError: Could not find model in TheBloke/Llama-2-7b-Chat-GPTQ

Does the docker container automatically download the relevant model, or am I meant to place it somewhere? I looked around the cache folder, but did not find model file:

# ls -R ~/.cache/huggingface/hub/
/root/.cache/huggingface/hub/:
models--TheBloke--Llama-2-7b-Chat-GPTQ  version.txt

/root/.cache/huggingface/hub/models--TheBloke--Llama-2-7b-Chat-GPTQ:
blobs  refs  snapshots

/root/.cache/huggingface/hub/models--TheBloke--Llama-2-7b-Chat-GPTQ/blobs:
400e3de6ffc3884ec3c158a046f6a04da00ef3ca
470e93138611b5efe37f4dd512a3fa14aff4bdc7
67a2e09f9d8b5e85eca24e88aa5c0fb465bbafd6
9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
d85ba6cb6820b01226ef8bd40b46bb489041c6a8
fd41f9992eeb3e7737447750c7a43c1d941834e4

/root/.cache/huggingface/hub/models--TheBloke--Llama-2-7b-Chat-GPTQ/refs:
main

/root/.cache/huggingface/hub/models--TheBloke--Llama-2-7b-Chat-GPTQ/snapshots:
52f2f87caf57c6f42037f82b405e4f3bac3154d4

/root/.cache/huggingface/hub/models--TheBloke--Llama-2-7b-Chat-GPTQ/snapshots/52f2f87caf57c6f42037f82b405e4f3bac3154d4:
config.json           special_tokens_map.json  tokenizer.json
quantize_config.json  tokenizer_config.json    tokenizer.model

# du -sh ~/.cache/huggingface/hub/
2.3M    /root/.cache/huggingface/hub/

I also tried changing one of the parameters to --load_gptq=model as I saw that mentioned in a similar issue, but that didn't work either.

Thanks for reading this far.

Any ideas?

pseudotensor commented 1 year ago

Try https://github.com/h2oai/h2ogpt/issues/886#issuecomment-1732707072

closing single same issue likely. Feel free to response and I'll still notice.

rahimnathwani commented 1 year ago

Thanks for the fast response. I already tried using --load_gptq=model but it didn't work:

root@abs:~# rm -rf ~/.cache/
root@abs:~# rm -rf ~/save
root@abs:~# mkdir -p $HOME/.cache
mkdir -p $HOME/save
export CUDA_VISIBLE_DEVICES=0
docker run \
       --gpus all \
       --runtime=nvidia \
       --shm-size=2g \
       -p 7860:7860 \
       --rm --init \
       --network host \
       -v /etc/passwd:/etc/passwd:ro \
       -v /etc/group:/etc/group:ro \
       -u `id -u`:`id -g` \
       -v "${HOME}"/.cache:/workspace/.cache \
       -v "${HOME}"/save:/workspace/save \
       -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
       gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \
          --base_model=TheBloke/Llama-2-7b-Chat-GPTQ \
          --load_gptq=model \
          --use_safetensors=True \
          --prompt_type=llama2 \
          --save_dir='/workspace/save/' \
          --use_gpu_id=False \
          --score_model=None \
          --max_max_new_tokens=2048 \
          --max_new_tokens=1024
WARNING: Published ports are discarded when using host network mode
Using Model thebloke/llama-2-7b-chat-gptq
Starting get_model: TheBloke/Llama-2-7b-Chat-GPTQ 
/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
Downloading (…)lve/main/config.json: 100%|██████████| 789/789 [00:00<00:00, 5.17MB/s]
/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
Downloading (…)okenizer_config.json: 100%|██████████| 727/727 [00:00<00:00, 8.94MB/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 1.86MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.84M/1.84M [00:00<00:00, 4.38MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 411/411 [00:00<00:00, 4.45MB/s]
/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
/h2ogpt_conda/lib/python3.10/site-packages/transformers/utils/hub.py:374: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
Downloading (…)quantize_config.json: 100%|██████████| 188/188 [00:00<00:00, 2.36MB/s]
Traceback (most recent call last):
  File "/workspace/generate.py", line 16, in <module>
    entrypoint_main()
  File "/workspace/generate.py", line 12, in entrypoint_main
    H2O_Fire(main)
  File "/workspace/src/utils.py", line 59, in H2O_Fire
    fire.Fire(component=component, command=args)
  File "/h2ogpt_conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/h2ogpt_conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/h2ogpt_conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/src/gen.py", line 1162, in main
    model0, tokenizer0, device = get_model(reward_type=False,
  File "/workspace/src/gen.py", line 1639, in get_model
    return get_hf_model(load_8bit=load_8bit,
  File "/workspace/src/gen.py", line 1820, in get_hf_model
    model = model_loader(
  File "/h2ogpt_conda/lib/python3.10/site-packages/auto_gptq/modeling/auto.py", line 108, in from_quantized
    return quant_func(
  File "/h2ogpt_conda/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 791, in from_quantized
    raise FileNotFoundError(f"Could not find model in {model_name_or_path}")
FileNotFoundError: Could not find model in TheBloke/Llama-2-7b-Chat-GPTQ
root@abs:~#

rahimnathwani commented 1 year ago

In the log above, I can't see that the code is ever trying to download the model.

In contrast, if I try the first suggested command in the Docker instructions (using --base_model=h2oai/h2ogpt-4096-llama2-7b-chat), then it downloads a safetensors file:

# docker run        --gpus all        --runtime=nvidia        --shm-size=2g        -p 7860:7860        --rm --init        --network host        -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES        -v /etc/passwd:/etc/passwd:ro        -v /etc/group:/etc/group:ro        -u `id -u`:`id -g`        -v "${HOME}"/.cache:/workspace/.cache        -v "${HOME}"/save:/workspace/save        gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py           --base_model=h2oai/h2ogpt-4096-llama2-7b-chat           --use_safetensors=True           --prompt_type=llama2           --save_dir='/workspace/save/'           --use_gpu_id=False           --score_model=None           --max_max_new_tokens=2048           --max_new_tokens=1024
WARNING: Published ports are discarded when using host network mode
Using Model h2oai/h2ogpt-4096-llama2-7b-chat
Starting get_model: h2oai/h2ogpt-4096-llama2-7b-chat 
/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:479: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading (…)of-00002.safetensors

pseudotensor commented 1 year ago

Ok, FYI if I do:

python generate.py --base_model=TheBloke/Llama-2-7B-chat-GPTQ --load_gptq=model --use_safetensors=True --prompt_type=llama2

it works:

Using Model thebloke/llama-2-7b-chat-gptq
Starting get_model: TheBloke/Llama-2-7B-chat-GPTQ 
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
device_map: {'': 0}
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/utils/hub.py:374: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
WARNING:auto_gptq.nn_modules.fused_llama_mlp:skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.                                                                                                            
Model {'base_model': 'TheBloke/Llama-2-7B-chat-GPTQ', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'llama2', 'prompt_dict': {'promptA': '', 'promptB': '', 'PreInstruct': '<s>[INST] ', 'PreInput': None, 'PreResponse': '[/INST]', 'terminate_response': ['[INST]', '</s>'], 'chat_sep': ' ', 'chat_turn_sep': ' </s>', 'humanstr': '[INST]', 'botstr': '[/INST]', 'generates_leading_space': False, 'system_prompt': ''}, 'load_8bit': False, 'load_4bit': False, 'low_bit_mode': 1, 'load_half': True, 'load_gptq': 'model', 'load_exllama': False, 'use_safetensors': True, 'revision': None, 'use_gpu_id': True, 'gpu_id': 0, 'compile_model': True, 'use_cache': None, 'llamacpp_dict': {'n_gpu_layers': 100, 'use_mlock': True, 'n_batch': 1024, 'n_gqa': 0, 'model_path_llama': 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin', 'model_name_gptj': 'ggml-gpt4all-j-v1.3-groovy.bin', 'model_name_gpt4all_llama': 'ggml-wizardLM-7B.q4_2.bin', 'model_name_exllama_if_no_config': 'TheBloke/Nous-Hermes-Llama2-GPTQ'}, 'model_path_llama': 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin', 'model_name_gptj': 'ggml-gpt4all-j-v1.3-groovy.bin', 'model_name_gpt4all_llama': 'ggml-wizardLM-7B.q4_2.bin', 'model_name_exllama_if_no_config': 'TheBloke/Nous-Hermes-Llama2-GPTQ'}
Starting get_model: OpenAssistant/reward-model-deberta-v3-large-v2 
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
device_map: {'': 1}
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:479: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
load INSTRUCTOR_Transformer
max_seq_length  512
Running on local URL:  http://0.0.0.0:7860

My guess is that GPTQ is using some path that is not standard and that needs to be mapped inside docker.

rahimnathwani commented 1 year ago

Got it. I commented out --save_dir and all the other flags you didn't use:

#          --save_dir='/workspace/save/' \
#          --use_gpu_id=False \
#          --score_model=None \
#          --max_max_new_tokens=2048 \
#          --max_new_tokens=1024

Now it's downloading the TheBloke/Llama-2-7b-Chat-GPTQ safetensors file:

Downloading model.safetensors:   1%|          | 21.0M/3.90G [01:13<3:11:18, 338kB/s]

pseudotensor commented 1 year ago

Ok that's odd. --save_dir etc. shouldn't matter. Must be some issue with spaces or quotes in some place.

harishkumarbalaji commented 1 year ago

Raised a PR with a fix for this- https://github.com/h2oai/h2ogpt/pull/933#issue-1931386876

h2oai / h2ogpt

FileNotFoundError following Linux Docker GPU instructions #893