Closed rahimnathwani closed 1 year ago
Try https://github.com/h2oai/h2ogpt/issues/886#issuecomment-1732707072
closing single same issue likely. Feel free to response and I'll still notice.
Thanks for the fast response. I already tried using --load_gptq=model
but it didn't work:
root@abs:~# rm -rf ~/.cache/
root@abs:~# rm -rf ~/save
root@abs:~# mkdir -p $HOME/.cache
mkdir -p $HOME/save
export CUDA_VISIBLE_DEVICES=0
docker run \
--gpus all \
--runtime=nvidia \
--shm-size=2g \
-p 7860:7860 \
--rm --init \
--network host \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache:/workspace/.cache \
-v "${HOME}"/save:/workspace/save \
-e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py \
--base_model=TheBloke/Llama-2-7b-Chat-GPTQ \
--load_gptq=model \
--use_safetensors=True \
--prompt_type=llama2 \
--save_dir='/workspace/save/' \
--use_gpu_id=False \
--score_model=None \
--max_max_new_tokens=2048 \
--max_new_tokens=1024
WARNING: Published ports are discarded when using host network mode
Using Model thebloke/llama-2-7b-chat-gptq
Starting get_model: TheBloke/Llama-2-7b-Chat-GPTQ
/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Downloading (…)lve/main/config.json: 100%|██████████| 789/789 [00:00<00:00, 5.17MB/s]
/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Downloading (…)okenizer_config.json: 100%|██████████| 727/727 [00:00<00:00, 8.94MB/s]
Downloading tokenizer.model: 100%|██████████| 500k/500k [00:00<00:00, 1.86MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.84M/1.84M [00:00<00:00, 4.38MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 411/411 [00:00<00:00, 4.45MB/s]
/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
/h2ogpt_conda/lib/python3.10/site-packages/transformers/utils/hub.py:374: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Downloading (…)quantize_config.json: 100%|██████████| 188/188 [00:00<00:00, 2.36MB/s]
Traceback (most recent call last):
File "/workspace/generate.py", line 16, in <module>
entrypoint_main()
File "/workspace/generate.py", line 12, in entrypoint_main
H2O_Fire(main)
File "/workspace/src/utils.py", line 59, in H2O_Fire
fire.Fire(component=component, command=args)
File "/h2ogpt_conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/h2ogpt_conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/h2ogpt_conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/workspace/src/gen.py", line 1162, in main
model0, tokenizer0, device = get_model(reward_type=False,
File "/workspace/src/gen.py", line 1639, in get_model
return get_hf_model(load_8bit=load_8bit,
File "/workspace/src/gen.py", line 1820, in get_hf_model
model = model_loader(
File "/h2ogpt_conda/lib/python3.10/site-packages/auto_gptq/modeling/auto.py", line 108, in from_quantized
return quant_func(
File "/h2ogpt_conda/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 791, in from_quantized
raise FileNotFoundError(f"Could not find model in {model_name_or_path}")
FileNotFoundError: Could not find model in TheBloke/Llama-2-7b-Chat-GPTQ
root@abs:~#
In the log above, I can't see that the code is ever trying to download the model.
In contrast, if I try the first suggested command in the Docker instructions (using --base_model=h2oai/h2ogpt-4096-llama2-7b-chat
), then it downloads a safetensors file:
# docker run --gpus all --runtime=nvidia --shm-size=2g -p 7860:7860 --rm --init --network host -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -u `id -u`:`id -g` -v "${HOME}"/.cache:/workspace/.cache -v "${HOME}"/save:/workspace/save gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0 /workspace/generate.py --base_model=h2oai/h2ogpt-4096-llama2-7b-chat --use_safetensors=True --prompt_type=llama2 --save_dir='/workspace/save/' --use_gpu_id=False --score_model=None --max_max_new_tokens=2048 --max_new_tokens=1024
WARNING: Published ports are discarded when using host network mode
Using Model h2oai/h2ogpt-4096-llama2-7b-chat
Starting get_model: h2oai/h2ogpt-4096-llama2-7b-chat
/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:479: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Downloading shards: 0%| | 0/2 [00:00<?, ?it/s]
Downloading (…)of-00002.safetensors
Ok, FYI if I do:
python generate.py --base_model=TheBloke/Llama-2-7B-chat-GPTQ --load_gptq=model --use_safetensors=True --prompt_type=llama2
it works:
Using Model thebloke/llama-2-7b-chat-gptq
Starting get_model: TheBloke/Llama-2-7B-chat-GPTQ
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
device_map: {'': 0}
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/utils/hub.py:374: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
WARNING:auto_gptq.nn_modules.fused_llama_mlp:skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
Model {'base_model': 'TheBloke/Llama-2-7B-chat-GPTQ', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'llama2', 'prompt_dict': {'promptA': '', 'promptB': '', 'PreInstruct': '<s>[INST] ', 'PreInput': None, 'PreResponse': '[/INST]', 'terminate_response': ['[INST]', '</s>'], 'chat_sep': ' ', 'chat_turn_sep': ' </s>', 'humanstr': '[INST]', 'botstr': '[/INST]', 'generates_leading_space': False, 'system_prompt': ''}, 'load_8bit': False, 'load_4bit': False, 'low_bit_mode': 1, 'load_half': True, 'load_gptq': 'model', 'load_exllama': False, 'use_safetensors': True, 'revision': None, 'use_gpu_id': True, 'gpu_id': 0, 'compile_model': True, 'use_cache': None, 'llamacpp_dict': {'n_gpu_layers': 100, 'use_mlock': True, 'n_batch': 1024, 'n_gqa': 0, 'model_path_llama': 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin', 'model_name_gptj': 'ggml-gpt4all-j-v1.3-groovy.bin', 'model_name_gpt4all_llama': 'ggml-wizardLM-7B.q4_2.bin', 'model_name_exllama_if_no_config': 'TheBloke/Nous-Hermes-Llama2-GPTQ'}, 'model_path_llama': 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin', 'model_name_gptj': 'ggml-gpt4all-j-v1.3-groovy.bin', 'model_name_gpt4all_llama': 'ggml-wizardLM-7B.q4_2.bin', 'model_name_exllama_if_no_config': 'TheBloke/Nous-Hermes-Llama2-GPTQ'}
Starting get_model: OpenAssistant/reward-model-deberta-v3-large-v2
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1006: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:640: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
device_map: {'': 1}
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:479: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
load INSTRUCTOR_Transformer
max_seq_length 512
Running on local URL: http://0.0.0.0:7860
My guess is that GPTQ is using some path that is not standard and that needs to be mapped inside docker.
Got it. I commented out --save_dir
and all the other flags you didn't use:
# --save_dir='/workspace/save/' \
# --use_gpu_id=False \
# --score_model=None \
# --max_max_new_tokens=2048 \
# --max_new_tokens=1024
Now it's downloading the TheBloke/Llama-2-7b-Chat-GPTQ safetensors file:
Downloading model.safetensors: 1%| | 21.0M/3.90G [01:13<3:11:18, 338kB/s]
Ok that's odd. --save_dir etc. shouldn't matter. Must be some issue with spaces or quotes in some place.
Raised a PR with a fix for this- https://github.com/h2oai/h2ogpt/pull/933#issue-1931386876
The Linux GPU instructions say:
An example of running h2oGPT via docker using AutoGPTQ (4-bit, so using less GPU memory) with LLaMa2 7B model is:
When I run this I get the following error:
Does the docker container automatically download the relevant model, or am I meant to place it somewhere? I looked around the cache folder, but did not find model file:
I also tried changing one of the parameters to
--load_gptq=model
as I saw that mentioned in a similar issue, but that didn't work either.Thanks for reading this far.
Any ideas?