Closed bsudhanva closed 1 year ago
Please try GGML llama models instead
What exactly is the issue, can you please explain?
You are running out of GPU memory with the 7B falcon model even in 4-bit mode. Perhaps you have other things on the GPU. Can you run nvidia-smi and share before you run h2ooGPT?
Mon Aug 7 02:16:12 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 536.67 Driver Version: 536.67 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 4060 ... WDDM | 00000000:01:00.0 Off | N/A | | N/A 42C P3 9W / 45W | 0MiB / 8188MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
also another interesting observation
when i run this
python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path
and this
python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path --load_4bit=True
and this
above code with --load_8bit=True
they all produce the same error
"python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path --load_4bit=True Using Model llama Prep: persist_directory=db_dir_UserData exists, user_path=user_path passed, adding any changed or new documents 0it [00:00, ?it/s] 0it [00:00, ?it/s] Loaded 0 sources for potentially adding to UserData Starting get_model: llama Could not determine --max_seq_len, setting to 2048. Pass if not correct ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Laptop GPU llama.cpp: loading model from llama-2-7b-chat.ggmlv3.q8_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 7 (mostly Q8_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 8620.72 MB (+ 1026.00 MB per state) llama_model_load_internal: offloading 0 repeating layers to GPU llama_model_load_internal: offloaded 0/35 layers to GPU llama_model_load_internal: total VRAM used: 288 MB warning: failed to VirtualLock 17825792-byte buffer (after previously locking 1407303680 bytes): The paging file is too small for this operation to complete.
WARNING: failed to allocate 258.00 MB of pinned memory: out of memory
Traceback (most recent call last):
File "C:\Users\username\h2ogpt\generate.py", line 16, in
it's almost like bitsandbytes parameters have no impact whatsoever .
im curious to know why this is happening
Can you ensure you run in low-memory mode: https://github.com/h2oai/h2ogpt/blob/main/docs/FAQ.md#low-memory-mode
i.e. also add:
--hf_embedding_model=sentence-transformers/all-MiniLM-L6-v2 --score_model=None
produces this error (h2ogpt) C:\Users\username\h2ogpt>python generate.py --base_model=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 --hf_embedding_model=sentence-transformers/all-MiniLM-L6-v2 --score_model=None --load_8bit=True --langchain_mode='UserData' Using Model h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 Prep: persist_directory=db_dir_UserData exists, using Starting get_model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 Could not determine --max_seq_len, setting to 2048. Pass if not correct Could not determine --max_seq_len, setting to 2048. Pass if not correct Could not determine --max_seq_len, setting to 2048. Pass if not correct device_map: {'': 0} bin C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Traceback (most recent call last): File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 463, in load_state_dict return torch.load(checkpoint_file, map_location="cpu") File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 809, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 1172, in _load result = unpickler.load() File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 1142, in persistent_load typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location)) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 1112, in load_tensor storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 590938112 bytes.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 467, in load_state_dict
if f.read(7) == "version":
File "C:\Users\username\miniconda3\envs\h2ogpt\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1833: character maps to
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\username\h2ogpt\generate.py", line 16, in
I'm experiencing exactly the same issue. Did you solve is?
Nope not able to find any solution so far, waiting for author's help
The primary error I see it this:
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 590938112 bytes.
or with GGML:
arning: failed to VirtualLock 17825792-byte buffer (after previously locking 1407303680 bytes): The paging file is too small for this operation to complete.
WARNING: failed to allocate 258.00 MB of pinned memory: out of memory
Traceback (most recent call last):
See: https://www.reddit.com/r/LocalLLaMA/comments/142rm0m/llamacpp_multi_gpu_support_has_been_merged/
You have insufficient pinned memory on your GPU. You can disable pinning like in the above thing I randomly googled, by setting this env:
export GGML_CUDA_NO_PINNED=1
in linux of the equivalent windows:
setenv GGML_CUDA_NO_PINNED=1
before launching h2oGPT.
it seems to be running, but it runs entirely out of my ram, (I tried to increase page file size to 20 gb, then it ran sucessfully), and moreover utilizes just 9% vram .
` (h2ogpt) C:\Users\username\h2ogpt>python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path GGML_CUDA_NO_PINNED=1 Using Model llama Prep: persist_directory=db_dir_UserData exists, user_path=user_path passed, adding any changed or new documents 0it [00:00, ?it/s] 0it [00:00, ?it/s] Loaded 0 sources for potentially adding to UserData Starting get_model: llama Could not determine --max_seq_len, setting to 2048. Pass if not correct ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Laptop GPU llama.cpp: loading model from llama-2-7b-chat.ggmlv3.q8_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 7 (mostly Q8_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 8620.72 MB (+ 1026.00 MB per state) llama_model_load_internal: offloading 0 repeating layers to GPU llama_model_load_internal: offloaded 0/35 layers to GPU llama_model_load_internal: total VRAM used: 288 MB warning: failed to VirtualLock 17825792-byte buffer (after previously locking 4418093056 bytes): The paging file is too small for this operation to complete.
llama_new_context_with_model: kv self size = 256.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'llama2', 'prompt_dict': {'promptA': '', 'promptB': '', 'PreInstruct': '[INST] ', 'PreInput': None, 'PreResponse': '[/INST]', 'terminate_response': ['[INST]', ''], 'chat_sep': ' ', 'chat_turn_sep': ' ', 'humanstr': '[INST]', 'botstr': '[/INST]', 'generates_leading_space': False, 'system_prompt': None}}
Running on local URL: http://0.0.0.0:7860
To create a public link, set share=True
in launch()
`
Do you have older h2oGPT?
You have in output:
offloaded 0/35 layers to GPU
but it should (in h2oGPT main as of several days ago) automatically use max by default and say:
llama_model_load_internal: offloaded 35/35 layers to GPU
That is what I see:
jon@pseudotensor:~/h2ogpt$ GGML_CUDA_NO_PINNED=1 python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path
jon@pseudotensor:~/h2ogpt$ GGML_CUDA_NO_PINNED=1 python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path
Using Model llama
Prep: persist_directory=db_dir_UserData exists, user_path=user_path passed, adding any changed or new documents
load INSTRUCTOR_Transformer
max_seq_length 512
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 9.26it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.14s/it]
Loaded 2 new files as sources to add to UserData
Loaded 264 sources for potentially adding to UserData
Existing db, potentially adding 264 sources from user_path=user_path
Found 264 new sources (0 have no hash in original source, so have to reprocess for migration to sources with hash)
Removing 0 duplicate files from db because ingesting those as new documents
Existing db, adding to db_dir_UserData
Existing db, added 264 new sources from user_path=user_path
Starting get_model: llama
Could not determine --max_seq_len, setting to 2048. Pass if not correct
Already have llama-2-7b-chat.ggmlv3.q8_0.bin from url https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin, delete file if invalid
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
Device 1: NVIDIA GeForce RTX 2080, compute capability 7.5
llama.cpp: loading model from llama-2-7b-chat.ggmlv3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090 Ti) as main device
llama_model_load_internal: mem required = 1804.89 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 8106 MB
warning: failed to mlock 139264000-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MLOCK ('ulimit -l' as root).
llama_new_context_with_model: kv self size = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'llama2', 'prompt_dict': {'promptA': '', 'promptB': '', 'PreInstruct': '<s>[INST] ', 'PreInput': None, 'PreResponse': '[/INST]', 'terminate_response': ['[INST]', '</s>'], 'chat_sep': ' ', 'chat_turn_sep': ' </s>', 'humanstr': '[INST]', 'botstr': '[/INST]', 'generates_leading_space': False, 'system_prompt': None}}
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
I have updated h2ogpt through git clone command, and moreover in the new version it doesn't mention anything about offloading layers to gpu
when i run the command
(h2ogpt) C:\Users\username\h2ogpt>GGML_CUDA_NO_PINNED=1 python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path
I get
'GGML_CUDA_NO_PINNED' is not recognized as an internal or external command, operable program or batch file.
and when i run this command
(h2ogpt) C:\Users\username\h2ogpt>python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path
i get
`Using Model llama
Prep: persist_directory=db_dir_UserData exists, user_path=user_path passed, adding any changed or new documents
0it [00:00, ?it/s]
0it [00:00, ?it/s]
Loaded 0 sources for potentially adding to UserData
Starting get_model: llama
Could not determine --max_seq_len, setting to 2048. Pass if not correct
Already have llama-2-7b-chat.ggmlv3.q8_0.bin from url https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin, delete file if invalid
llama.cpp: loading model from llama-2-7b-chat.ggmlv3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: mem required = 8500.72 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'llama2', 'prompt_dict': {'promptA': '', 'promptB': '', 'PreInstruct': '[INST] ', 'PreInput': None, 'PreResponse': '[/INST]', 'terminate_response': ['[INST]', ''], 'chat_sep': ' ', 'chat_turn_sep': ' ', 'humanstr': '[INST]', 'botstr': '[/INST]', 'generates_leading_space': False, 'system_prompt': None}}
Running on local URL: http://0.0.0.0:7860
To create a public link, set 'share=True' in 'launch()'. `
it works, but runs entirely on my ram and igpu for output and there is no mention of gpu like the ealier version of h2ogpt, (where it says i have xyz gpu and tries to offload to vram)
@bsudhanva In your case, it's not using GPU at all. I recommend the windows installer for GPU to help avoid installation issues.
are you referring to this link in the main page with the heading (h2oGPT GPU-CUDA Installer (1.8GB file)), also should I uninstall Cuda drivers?. BTW I really appreciate your support @pseudotensor
update: I downloaded and installed above file, there is same issue with the app too, It maxes out the ram usage and then shows error, when trying to load the model @pseudotensor
If you set GGML_CUDA_NO_PINNED you need to add to windows environment variable.
If you set GGML_CUDA_NO_PINNED you need to add to windows environment variable.
How exactly to do this?
Just normal windows thing, one can google search. This is one useful description with pictures:
You can also google about how to set env for a given program by changing the target to use cmd and set: https://stackoverflow.com/questions/3036325/can-i-set-an-environment-variable-for-an-application-using-a-shortcut-in-windows
Or you can link to a .bat file that sets the env, etc.
Just normal windows thing, one can google search. This is one useful description with pictures:
You can also google about how to set env for a given program by changing the target to use cmd and set: https://stackoverflow.com/questions/3036325/can-i-set-an-environment-variable-for-an-application-using-a-shortcut-in-windows
Or you can link to a .bat file that sets the env, etc.
I have set the environment variable as shown
also
even before running the program I run the command
set GGML_CUDA_NO_PINNED =1
It makes no difference.
Summary 1.so the app runs in the terminal while utilizing only cpu+ram 2.The installer version doesn't run, it maxes out my ram while loading the model and then after a while it displays "error" as error message
there is a small spike in dGPU (under Copy operation) usage initially, and then goes to 0.(when I try to run the app)
after the recent update, the program doesn't try to run on dGPU at all, earlier version it was trying to allocate to gpu and fails.
also
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Jul_11_03:10:21_Pacific_Daylight_Time_2023 Cuda compilation tools, release 12.2, V12.2.128 Build cuda_12.2.r12.2/compiler.33053471_0
Update: Made a fresh install and facing different issue - App unable to use GPU
I get the following error
"(h2ogpt) C:\Users\username\h2ogpt>python generate.py --base_model=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 --langchain_mode=UserData --score_model=None --load_4bit=True Using Model h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 Prep: persist_directory=db_dir_UserData exists, using Starting get_model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 Could not determine --max_seq_len, setting to 2048. Pass if not correct Could not determine --max_seq_len, setting to 2048. Pass if not correct Could not determine --max_seq_len, setting to 2048. Pass if not correct device_map: {'': 0} bin C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Traceback (most recent call last): File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 463, in load_state_dict return torch.load(checkpoint_file, map_location="cpu") File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 809, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 1172, in _load result = unpickler.load() File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 1142, in persistent_load typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location)) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 1112, in load_tensor storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 165183488 bytes.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 467, in load_state_dict if f.read(7) == "version": File "C:\Users\username\miniconda3\envs\h2ogpt\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1833: character maps to
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:\Users\username\h2ogpt\generate.py", line 16, in
entrypoint_main()
File "C:\Users\username\h2ogpt\generate.py", line 12, in entrypoint_main
H2O_Fire(main)
File "C:\Users\username\h2ogpt\src\utils.py", line 57, in H2O_Fire
fire.Fire(component=component, command=args)
File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\fire\core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\fire\core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "C:\Users\username\h2ogpt\src\gen.py", line 824, in main
model0, tokenizer0, device = get_model(reward_type=False,
File "C:\Users\username\h2ogpt\src\gen.py", line 1253, in get_model
return get_hf_model(load_8bit=load_8bit,
File "C:\Users\username\h2ogpt\src\gen.py", line 1385, in get_hf_model
model = get_non_lora_model(base_model, model_loader, load_half, load_gptq,
File "C:\Users\username\h2ogpt\src\gen.py", line 1034, in get_non_lora_model
model = model_loader(
File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\models\auto\auto_factory.py", line 479, in from_pretrained
return model_class.from_pretrained(
File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 2881, in from_pretrained
) = cls._load_pretrained_model(
File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 3214, in _load_pretrained_model
state_dict = load_state_dict(shard_file)
File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 479, in load_state_dict
raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for 'C:\Users\username/.cache\huggingface\hub\models--h2oai--h2ogpt-gm-oasst1-en-2048-falcon-7b-v3\snapshots\381b5e888699801426851281677b55f21a508396\pytorch_model-00001-of-00002.bin' at 'C:\Users\username/.cache\huggingface\hub\models--h2oai--h2ogpt-gm-oasst1-en-2048-falcon-7b-v3\snapshots\381b5e888699801426851281677b55f21a508396\pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True."
gpu- nvidia 4060 8gb vram