h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
http://h2o.ai
Apache License 2.0
11.31k stars 1.24k forks source link

pytorch_model.bin 1.34G download hangs forever on Linux #1615

Closed han-sogawa closed 5 months ago

han-sogawa commented 5 months ago

Hello, I've just done a fresh manual install of h2ogpt on linux. My OS is Rocky Linux 9.3, and I have CUDA 12.4 installed and available.

I ran the docs/linux_install.sh file (replacing apt-get with dnf package manager for Rocky Linux) and I believe I installed all required components. I had issues with some packages but they were marked optional and I don't think they are relevant to my problem.

Whenever I run generate.py with any parameters, it has the following output:

soundfile, librosa, and wavio not installed, disabling STT
soundfile, librosa, and wavio not installed, disabling TTS
Using Model [whatever I passed in as --base_model]
pytorch_model.bin:       0%|                                                 | 0.00/1.34G [00:00<?, ?B/s]

...and then it just hangs there forever, never downloading anything. It always has the same file name and size (1.34G) no matter what model I set as base_model. I even downloaded a model and set the path to model_path but I have the exact same output. I have google-chrome-stable installed.

Any ideas why this is happening or how I can dig deeper to see what pytorch_model.bin file it is trying so hard to download? Is there some kind of permissions I need to grant for the python folder to access the endpoint it's trying to reach?

Thank you

han-sogawa commented 5 months ago

https://huggingface.co/api/models/hkunlp/instructor-large is the file it cannot download, although I can access it on the browser.

pseudotensor commented 5 months ago

Are you using that as the base model? What is your actual generate.py line?

han-sogawa commented 5 months ago

No, it looks like it is another dependency, which attempts to download regardless of which base model I am using

One example of a generate.py line that I have tried: python generate.py --base_model=meta-llama/llama-2-7b-chat-hf --score_model=None --langchain_mode='UserData' --user_path=user_path --use_auth_token=True --max_seq_len=4096 --max_max_new_tokens=2048

han-sogawa commented 5 months ago

I think this may be where it is entering to try to download the file: https://github.com/h2oai/h2ogpt/blob/e0f5ab9eeac64d60e394180b5bf3e7be9876a649/src/gpt_langchain.py#L530

pseudotensor commented 5 months ago

What if you try a different embedding model, e.g. add to generate.py line:

--hf_embedding_model=sentence-transformers/all-MiniLM-L12-v2

Also, you can try disabling hf_transfer by setting this env:

export HF_HUB_ENABLE_HF_TRANSFER=0
pseudotensor commented 5 months ago

FYI this is what it looks like when running your command you gave:

(h2ogpt) jon@pseudotensor:~/h2ogpt$ python generate.py --base_model=meta-llama/llama-2-7b-chat-hf --score_model=None --langchain_mode='UserData' --user_path=user_path --use_auth_token=True --max_seq_len=4096 --max_max_new_tokens=2048
Using Model meta-llama/llama-2-7b-chat-hf
load INSTRUCTOR_Transformer
max_seq_length  512
Starting get_model: meta-llama/llama-2-7b-chat-hf 
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 614/614 [00:00<00:00, 1.45MB/s]
Overriding max_seq_len -> 4096
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.62k/1.62k [00:00<00:00, 3.93MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 8.89MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 5.95MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 414/414 [00:00<00:00, 876kB/s]
Overriding max_seq_len -> 4096
Overriding max_seq_len -> 4096
device_map: {'': 0}
pytorch_model.bin.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.8k/26.8k [00:00<00:00, 85.6MB/s]
pytorch_model-00001-of-00002.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 9.98G/9.98G [01:29<00:00, 112MB/s]
pytorch_model-00002-of-00002.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 3.50G/3.50G [00:31<00:00, 110MB/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [02:01<00:00, 60.77s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.30s/it]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 188/188 [00:00<00:00, 530kB/s]
Model {'base_model': 'meta-llama/llama-2-7b-chat-hf', 'base_model0': 'meta-llama/llama-2-7b-chat-hf', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'llama2', 'prompt_dict': {'promptA': '', 'promptB': '', 'PreInstruct': "<s>[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n", 'PreInput': None, 'PreResponse': '[/INST]', 'terminate_response': ['[INST]', '</s>'], 'chat_sep': ' ', 'chat_turn_sep': ' </s>', 'humanstr': '[INST]', 'botstr': '[/INST]', 'generates_leading_space': False, 'system_prompt': "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.", 'can_handle_system_prompt': True}, 'display_name': 'meta-llama/llama-2-7b-chat-hf', 'visible_models': None, 'h2ogpt_key': None, 'load_8bit': False, 'load_4bit': False, 'low_bit_mode': 1, 'load_half': True, 'use_flash_attention_2': False, 'load_gptq': '', 'load_awq': '', 'load_exllama': False, 'use_safetensors': False, 'revision': None, 'use_gpu_id': True, 'gpu_id': 0, 'compile_model': None, 'use_cache': None, 'llamacpp_dict': {'n_gpu_layers': 100, 'use_mlock': True, 'n_batch': 1024, 'n_gqa': 0, 'model_path_llama': '', 'model_name_gptj': '', 'model_name_gpt4all_llama': '', 'model_name_exllama_if_no_config': ''}, 'rope_scaling': {}, 'max_seq_len': 4096, 'max_output_seq_len': None, 'exllama_dict': {}, 'gptq_dict': {}, 'attention_sinks': False, 'sink_dict': {}, 'truncation_generation': False, 'hf_model_dict': {}, 'force_seq2seq_type': False, 'force_t5_type': False, 'trust_remote_code': True}
Begin auto-detect HF cache text generation models
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
No loading model philschmid/bart-large-cnn-samsum because is_encoder_decoder=True
/home/jon/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-30b-instruct/68deee8b69383b30826ea2fc642ba170b89e4edd/configuration_mpt.py:114: UserWarning: alibi or rope is turned on, setting `learned_pos_emb` to `False.`
  warnings.warn(f'alibi or rope is turned on, setting `learned_pos_emb` to `False.`')
/home/jon/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-30b-instruct/68deee8b69383b30826ea2fc642ba170b89e4edd/configuration_mpt.py:141: UserWarning: If not using a Prefix Language Model, we recommend setting "attn_impl" to "flash" instead of "triton".
  warnings.warn(UserWarning('If not using a Prefix Language Model, we recommend setting "attn_impl" to "flash" instead of "triton".'))
WARNING:transformers_modules.tiiuae.falcon-40b-instruct.ecb78d97ac356d098e79f0db222c9ce7c5d9ee5f.configuration_falcon:
WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.

No loading model openai/whisper-large-v3 because is_encoder_decoder=True
No loading model openai/whisper-base.en because is_encoder_decoder=True
No loading model h2oai/ggml because h2oai/ggml does not appear to have a file named config.json. Checkout 'https://huggingface.co/h2oai/ggml/main' for available files.
No loading model Systran/faster-whisper-large-v3 because is_encoder_decoder=True
No loading model openai/whisper-medium because is_encoder_decoder=True
No loading model philschmid/flan-t5-base-samsum because is_encoder_decoder=True
No loading model stabilityai/stable-diffusion-xl-refiner-1.0 because stabilityai/stable-diffusion-xl-refiner-1.0 does not appear to have a file named config.json. Checkout 'https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/main' for available files.
No loading model distil-whisper/distil-large-v2 because is_encoder_decoder=True
No loading model tloen/alpaca-lora-7b because tloen/alpaca-lora-7b does not appear to have a file named config.json. Checkout 'https://huggingface.co/tloen/alpaca-lora-7b/main' for available files.
/home/jon/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/039e37745f00858f0e01e988383a8c4393b1a4f5/configuration_mpt.py:114: UserWarning: alibi or rope is turned on, setting `learned_pos_emb` to `False.`
  warnings.warn(f'alibi or rope is turned on, setting `learned_pos_emb` to `False.`')
/home/jon/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/039e37745f00858f0e01e988383a8c4393b1a4f5/configuration_mpt.py:141: UserWarning: If not using a Prefix Language Model, we recommend setting "attn_impl" to "flash" instead of "triton".
  warnings.warn(UserWarning('If not using a Prefix Language Model, we recommend setting "attn_impl" to "flash" instead of "triton".'))
No loading model distil-whisper/distil-large-v3 because is_encoder_decoder=True
No loading model microsoft/speecht5_hifigan because The checkpoint you are trying to load has model type `hifigan` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
No loading model unstructuredio/detectron2_faster_rcnn_R_50_FPN_3x because unstructuredio/detectron2_faster_rcnn_R_50_FPN_3x does not appear to have a file named config.json. Checkout 'https://huggingface.co/unstructuredio/detectron2_faster_rcnn_R_50_FPN_3x/main' for available files.
No loading model stabilityai/stable-diffusion-xl-base-1.0 because stabilityai/stable-diffusion-xl-base-1.0 does not appear to have a file named config.json. Checkout 'https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/main' for available files.
No loading model Salesforce/blip2-flan-t5-xl because is_encoder_decoder=True
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:103: FutureWarning: The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect
  warnings.warn(
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/llava/configuration_llava.py:143: FutureWarning: The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.
  warnings.warn(
No loading model google/pix2struct-textcaps-base because is_encoder_decoder=True
No loading model Salesforce/blip2-flan-t5-xxl because is_encoder_decoder=True
/home/jon/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-30b-chat/28fc475f7b73a5631fbbc6419645c27177f275d4/configuration_mpt.py:114: UserWarning: alibi or rope is turned on, setting `learned_pos_emb` to `False.`
  warnings.warn(f'alibi or rope is turned on, setting `learned_pos_emb` to `False.`')
/home/jon/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-30b-chat/28fc475f7b73a5631fbbc6419645c27177f275d4/configuration_mpt.py:141: UserWarning: If not using a Prefix Language Model, we recommend setting "attn_impl" to "flash" instead of "triton".
  warnings.warn(UserWarning('If not using a Prefix Language Model, we recommend setting "attn_impl" to "flash" instead of "triton".'))
No loading model microsoft/speecht5_vc because is_encoder_decoder=True
No loading model microsoft/speecht5_tts because is_encoder_decoder=True
End auto-detect HF cache text generation models
Begin auto-detect llama.cpp models
End auto-detect llama.cpp models
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Started Gradio Server and/or GUI: server_name: localhost port: 7860
Use local URL: http://localhost:7860/
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/pydantic/_internal/_fields.py:160: UserWarning: Field "model_info" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/pydantic/_internal/_fields.py:160: UserWarning: Field "model_names" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
OpenAI API URL: http://0.0.0.0:5000
INFO:__name__:OpenAI API URL: http://0.0.0.0:5000
OpenAI API key: EMPTY
INFO:__name__:OpenAI API key: EMPTY

All fine here.

If I remove the instructor-large model and try again:

(h2ogpt) jon@pseudotensor:~/h2ogpt$ rm -rf ~/.cache/torch/sentence_transformers/hkunlp_instructor-large/
(h2ogpt) jon@pseudotensor:~/h2ogpt$ python generate.py --base_model=meta-llama/llama-2-7b-chat-hf --score_model=None --langchain_mode='UserData' --user_path=user_path --use_auth_token=True --max_seq_len=4096 --max_max_new_tokens=2048
Using Model meta-llama/llama-2-7b-chat-hf
.gitattributes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.48k/1.48k [00:00<00:00, 3.83MB/s]
1_Pooling/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [00:00<00:00, 792kB/s]
2_Dense/config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 1.52MB/s]
pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.15M/3.15M [00:00<00:00, 31.9MB/s]
README.md: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 66.3k/66.3k [00:00<00:00, 1.16MB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.53k/1.53k [00:00<00:00, 3.43MB/s]
config_sentence_transformers.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 122/122 [00:00<00:00, 1.59MB/s]
pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 1.34G/1.34G [00:13<00:00, 100MB/s]
sentence_bert_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 116kB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.20k/2.20k [00:00<00:00, 6.28MB/s]
spiece.model: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 792k/792k [00:00<00:00, 13.6MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.42M/2.42M [00:00<00:00, 12.8MB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.41k/2.41k [00:00<00:00, 7.18MB/s]
modules.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 461/461 [00:00<00:00, 5.86MB/s]
load INSTRUCTOR_Transformer
... same as before

It downloads fine. So I guess you have some network complication.

han-sogawa commented 5 months ago

The workaround of adding --hf_embedding_model=sentence-transformers/all-MiniLM-L12-v2 worked for me, thank you! Still don't know why the instructor-large embedding file wouldn't download. I'll update if I find out more, but for now, my issue is resolved. Thank you very much!