h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
http://h2o.ai
Apache License 2.0
11.24k stars 1.23k forks source link

unable to run the app #620

Closed bsudhanva closed 1 year ago

bsudhanva commented 1 year ago

I get the following error

"(h2ogpt) C:\Users\username\h2ogpt>python generate.py --base_model=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 --langchain_mode=UserData --score_model=None --load_4bit=True Using Model h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 Prep: persist_directory=db_dir_UserData exists, using Starting get_model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 Could not determine --max_seq_len, setting to 2048. Pass if not correct Could not determine --max_seq_len, setting to 2048. Pass if not correct Could not determine --max_seq_len, setting to 2048. Pass if not correct device_map: {'': 0} bin C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Traceback (most recent call last): File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 463, in load_state_dict return torch.load(checkpoint_file, map_location="cpu") File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 809, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 1172, in _load result = unpickler.load() File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 1142, in persistent_load typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location)) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 1112, in load_tensor storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 165183488 bytes.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 467, in load_state_dict if f.read(7) == "version": File "C:\Users\username\miniconda3\envs\h2ogpt\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1833: character maps to

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\username\h2ogpt\generate.py", line 16, in entrypoint_main() File "C:\Users\username\h2ogpt\generate.py", line 12, in entrypoint_main H2O_Fire(main) File "C:\Users\username\h2ogpt\src\utils.py", line 57, in H2O_Fire fire.Fire(component=component, command=args) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "C:\Users\username\h2ogpt\src\gen.py", line 824, in main model0, tokenizer0, device = get_model(reward_type=False, File "C:\Users\username\h2ogpt\src\gen.py", line 1253, in get_model return get_hf_model(load_8bit=load_8bit, File "C:\Users\username\h2ogpt\src\gen.py", line 1385, in get_hf_model model = get_non_lora_model(base_model, model_loader, load_half, load_gptq, File "C:\Users\username\h2ogpt\src\gen.py", line 1034, in get_non_lora_model model = model_loader( File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\models\auto\auto_factory.py", line 479, in from_pretrained return model_class.from_pretrained( File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 2881, in from_pretrained ) = cls._load_pretrained_model( File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 3214, in _load_pretrained_model state_dict = load_state_dict(shard_file) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 479, in load_state_dict raise OSError( OSError: Unable to load weights from pytorch checkpoint file for 'C:\Users\username/.cache\huggingface\hub\models--h2oai--h2ogpt-gm-oasst1-en-2048-falcon-7b-v3\snapshots\381b5e888699801426851281677b55f21a508396\pytorch_model-00001-of-00002.bin' at 'C:\Users\username/.cache\huggingface\hub\models--h2oai--h2ogpt-gm-oasst1-en-2048-falcon-7b-v3\snapshots\381b5e888699801426851281677b55f21a508396\pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True."

gpu- nvidia 4060 8gb vram

pseudotensor commented 1 year ago

Please try GGML llama models instead

bsudhanva commented 1 year ago

What exactly is the issue, can you please explain?

pseudotensor commented 1 year ago

You are running out of GPU memory with the 7B falcon model even in 4-bit mode. Perhaps you have other things on the GPU. Can you run nvidia-smi and share before you run h2ooGPT?

bsudhanva commented 1 year ago

Mon Aug 7 02:16:12 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 536.67 Driver Version: 536.67 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 4060 ... WDDM | 00000000:01:00.0 Off | N/A | | N/A 42C P3 9W / 45W | 0MiB / 8188MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+

bsudhanva commented 1 year ago

also another interesting observation when i run this python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path and this python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path --load_4bit=True and this above code with --load_8bit=True

they all produce the same error

"python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path --load_4bit=True Using Model llama Prep: persist_directory=db_dir_UserData exists, user_path=user_path passed, adding any changed or new documents 0it [00:00, ?it/s] 0it [00:00, ?it/s] Loaded 0 sources for potentially adding to UserData Starting get_model: llama Could not determine --max_seq_len, setting to 2048. Pass if not correct ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Laptop GPU llama.cpp: loading model from llama-2-7b-chat.ggmlv3.q8_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 7 (mostly Q8_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 8620.72 MB (+ 1026.00 MB per state) llama_model_load_internal: offloading 0 repeating layers to GPU llama_model_load_internal: offloaded 0/35 layers to GPU llama_model_load_internal: total VRAM used: 288 MB warning: failed to VirtualLock 17825792-byte buffer (after previously locking 1407303680 bytes): The paging file is too small for this operation to complete.

WARNING: failed to allocate 258.00 MB of pinned memory: out of memory Traceback (most recent call last): File "C:\Users\username\h2ogpt\generate.py", line 16, in entrypoint_main() File "C:\Users\username\h2ogpt\generate.py", line 12, in entrypoint_main H2O_Fire(main) File "C:\Users\username\h2ogpt\src\utils.py", line 57, in H2O_Fire fire.Fire(component=component, command=args) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, kwargs) File "C:\Users\username\h2ogpt\src\gen.py", line 824, in main model0, tokenizer0, device = get_model(reward_type=False, File "C:\Users\username\h2ogpt\src\gen.py", line 1247, in get_model model, tokenizer, device = get_model_tokenizer_gpt4all(base_model, n_jobs=n_jobs) File "C:\Users\username\h2ogpt\src\gpt4all_llm.py", line 16, in get_model_tokenizer_gpt4all model = get_llm_gpt4all(model_name, model=None, File "C:\Users\username\h2ogpt\src\gpt4all_llm.py", line 132, in get_llm_gpt4all llm = cls(model_kwargs) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\langchain\load\serializable.py", line 74, in init super().init(**kwargs) File "pydantic\main.py", line 341, in pydantic.main.BaseModel.init pydantic.error_wrappers.ValidationError: 1 validation error for H2OLlamaCpp root Could not load Llama model from path: llama-2-7b-chat.ggmlv3.q8_0.bin. Received error [WinError -529697949] Windows Error 0xe06d7363 (type=value_error) Exception ignored in: <function Llama.del at 0x0000023D68F86E60> Traceback (most recent call last): File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\llama_cpp\llama.py", line 1445, in del if self.ctx is not None: AttributeError: 'Llama' object has no attribute 'ctx' "

it's almost like bitsandbytes parameters have no impact whatsoever .

im curious to know why this is happening

pseudotensor commented 1 year ago

Can you ensure you run in low-memory mode: https://github.com/h2oai/h2ogpt/blob/main/docs/FAQ.md#low-memory-mode

i.e. also add:

--hf_embedding_model=sentence-transformers/all-MiniLM-L6-v2 --score_model=None
bsudhanva commented 1 year ago

produces this error (h2ogpt) C:\Users\username\h2ogpt>python generate.py --base_model=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 --hf_embedding_model=sentence-transformers/all-MiniLM-L6-v2 --score_model=None --load_8bit=True --langchain_mode='UserData' Using Model h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 Prep: persist_directory=db_dir_UserData exists, using Starting get_model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 Could not determine --max_seq_len, setting to 2048. Pass if not correct Could not determine --max_seq_len, setting to 2048. Pass if not correct Could not determine --max_seq_len, setting to 2048. Pass if not correct device_map: {'': 0} bin C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Traceback (most recent call last): File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 463, in load_state_dict return torch.load(checkpoint_file, map_location="cpu") File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 809, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 1172, in _load result = unpickler.load() File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 1142, in persistent_load typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location)) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\torch\serialization.py", line 1112, in load_tensor storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 590938112 bytes.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 467, in load_state_dict if f.read(7) == "version": File "C:\Users\username\miniconda3\envs\h2ogpt\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1833: character maps to

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\username\h2ogpt\generate.py", line 16, in entrypoint_main() File "C:\Users\username\h2ogpt\generate.py", line 12, in entrypoint_main H2O_Fire(main) File "C:\Users\username\h2ogpt\src\utils.py", line 57, in H2O_Fire fire.Fire(component=component, command=args) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "C:\Users\username\h2ogpt\src\gen.py", line 824, in main model0, tokenizer0, device = get_model(reward_type=False, File "C:\Users\username\h2ogpt\src\gen.py", line 1253, in get_model return get_hf_model(load_8bit=load_8bit, File "C:\Users\username\h2ogpt\src\gen.py", line 1385, in get_hf_model model = get_non_lora_model(base_model, model_loader, load_half, load_gptq, File "C:\Users\username\h2ogpt\src\gen.py", line 1034, in get_non_lora_model model = model_loader( File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\models\auto\auto_factory.py", line 479, in from_pretrained return model_class.from_pretrained( File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 2881, in from_pretrained ) = cls._load_pretrained_model( File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 3214, in _load_pretrained_model state_dict = load_state_dict(shard_file) File "C:\Users\username\miniconda3\envs\h2ogpt\lib\site-packages\transformers\modeling_utils.py", line 479, in load_state_dict raise OSError( OSError: Unable to load weights from pytorch checkpoint file for 'C:\Users\username/.cache\huggingface\hub\models--h2oai--h2ogpt-gm-oasst1-en-2048-falcon-7b-v3\snapshots\381b5e888699801426851281677b55f21a508396\pytorch_model-00001-of-00002.bin' at 'C:\Users\username/.cache\huggingface\hub\models--h2oai--h2ogpt-gm-oasst1-en-2048-falcon-7b-v3\snapshots\381b5e888699801426851281677b55f21a508396\pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

gtroia1971 commented 1 year ago

I'm experiencing exactly the same issue. Did you solve is?

bsudhanva commented 1 year ago

Nope not able to find any solution so far, waiting for author's help

pseudotensor commented 1 year ago

The primary error I see it this:

RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 590938112 bytes.

or with GGML:

arning: failed to VirtualLock 17825792-byte buffer (after previously locking 1407303680 bytes): The paging file is too small for this operation to complete.

WARNING: failed to allocate 258.00 MB of pinned memory: out of memory
Traceback (most recent call last):

See: https://www.reddit.com/r/LocalLLaMA/comments/142rm0m/llamacpp_multi_gpu_support_has_been_merged/

You have insufficient pinned memory on your GPU. You can disable pinning like in the above thing I randomly googled, by setting this env:

export GGML_CUDA_NO_PINNED=1

in linux of the equivalent windows:

setenv GGML_CUDA_NO_PINNED=1

before launching h2oGPT.

bsudhanva commented 1 year ago

it seems to be running, but it runs entirely out of my ram, (I tried to increase page file size to 20 gb, then it ran sucessfully), and moreover utilizes just 9% vram .

` (h2ogpt) C:\Users\username\h2ogpt>python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path GGML_CUDA_NO_PINNED=1 Using Model llama Prep: persist_directory=db_dir_UserData exists, user_path=user_path passed, adding any changed or new documents 0it [00:00, ?it/s] 0it [00:00, ?it/s] Loaded 0 sources for potentially adding to UserData Starting get_model: llama Could not determine --max_seq_len, setting to 2048. Pass if not correct ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Laptop GPU llama.cpp: loading model from llama-2-7b-chat.ggmlv3.q8_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 7 (mostly Q8_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 8620.72 MB (+ 1026.00 MB per state) llama_model_load_internal: offloading 0 repeating layers to GPU llama_model_load_internal: offloaded 0/35 layers to GPU llama_model_load_internal: total VRAM used: 288 MB warning: failed to VirtualLock 17825792-byte buffer (after previously locking 4418093056 bytes): The paging file is too small for this operation to complete.

llama_new_context_with_model: kv self size = 256.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'llama2', 'prompt_dict': {'promptA': '', 'promptB': '', 'PreInstruct': '[INST] ', 'PreInput': None, 'PreResponse': '[/INST]', 'terminate_response': ['[INST]', ''], 'chat_sep': ' ', 'chat_turn_sep': ' ', 'humanstr': '[INST]', 'botstr': '[/INST]', 'generates_leading_space': False, 'system_prompt': None}} Running on local URL: http://0.0.0.0:7860

To create a public link, set share=True in launch() `

pseudotensor commented 1 year ago

Do you have older h2oGPT?

You have in output:

offloaded 0/35 layers to GPU

but it should (in h2oGPT main as of several days ago) automatically use max by default and say:

llama_model_load_internal: offloaded 35/35 layers to GPU

That is what I see:

jon@pseudotensor:~/h2ogpt$ GGML_CUDA_NO_PINNED=1 python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path
jon@pseudotensor:~/h2ogpt$ GGML_CUDA_NO_PINNED=1 python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path
Using Model llama
Prep: persist_directory=db_dir_UserData exists, user_path=user_path passed, adding any changed or new documents
load INSTRUCTOR_Transformer
max_seq_length  512
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.26it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.14s/it]
Loaded 2 new files as sources to add to UserData
Loaded 264 sources for potentially adding to UserData
Existing db, potentially adding 264 sources from user_path=user_path
Found 264 new sources (0 have no hash in original source, so have to reprocess for migration to sources with hash)
Removing 0 duplicate files from db because ingesting those as new documents
Existing db, adding to db_dir_UserData
Existing db, added 264 new sources from user_path=user_path
Starting get_model: llama 
Could not determine --max_seq_len, setting to 2048.  Pass if not correct
Already have llama-2-7b-chat.ggmlv3.q8_0.bin from url https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin, delete file if invalid
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
  Device 1: NVIDIA GeForce RTX 2080, compute capability 7.5
llama.cpp: loading model from llama-2-7b-chat.ggmlv3.q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090 Ti) as main device
llama_model_load_internal: mem required  = 1804.89 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 8106 MB
warning: failed to mlock 139264000-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MLOCK ('ulimit -l' as root).
llama_new_context_with_model: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'llama2', 'prompt_dict': {'promptA': '', 'promptB': '', 'PreInstruct': '<s>[INST] ', 'PreInput': None, 'PreResponse': '[/INST]', 'terminate_response': ['[INST]', '</s>'], 'chat_sep': ' ', 'chat_turn_sep': ' </s>', 'humanstr': '[INST]', 'botstr': '[/INST]', 'generates_leading_space': False, 'system_prompt': None}}
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
bsudhanva commented 1 year ago

I have updated h2ogpt through git clone command, and moreover in the new version it doesn't mention anything about offloading layers to gpu

when i run the command (h2ogpt) C:\Users\username\h2ogpt>GGML_CUDA_NO_PINNED=1 python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path

I get 'GGML_CUDA_NO_PINNED' is not recognized as an internal or external command, operable program or batch file.

and when i run this command (h2ogpt) C:\Users\username\h2ogpt>python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path i get

`Using Model llama Prep: persist_directory=db_dir_UserData exists, user_path=user_path passed, adding any changed or new documents 0it [00:00, ?it/s] 0it [00:00, ?it/s] Loaded 0 sources for potentially adding to UserData Starting get_model: llama Could not determine --max_seq_len, setting to 2048. Pass if not correct Already have llama-2-7b-chat.ggmlv3.q8_0.bin from url https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin, delete file if invalid llama.cpp: loading model from llama-2-7b-chat.ggmlv3.q8_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 7 (mostly Q8_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: mem required = 8500.72 MB (+ 1026.00 MB per state) llama_new_context_with_model: kv self size = 1024.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'llama2', 'prompt_dict': {'promptA': '', 'promptB': '', 'PreInstruct': '[INST] ', 'PreInput': None, 'PreResponse': '[/INST]', 'terminate_response': ['[INST]', ''], 'chat_sep': ' ', 'chat_turn_sep': ' ', 'humanstr': '[INST]', 'botstr': '[/INST]', 'generates_leading_space': False, 'system_prompt': None}} Running on local URL: http://0.0.0.0:7860

To create a public link, set 'share=True' in 'launch()'. `

it works, but runs entirely on my ram and igpu for output and there is no mention of gpu like the ealier version of h2ogpt, (where it says i have xyz gpu and tries to offload to vram)

pseudotensor commented 1 year ago

@bsudhanva In your case, it's not using GPU at all. I recommend the windows installer for GPU to help avoid installation issues.

bsudhanva commented 1 year ago

are you referring to this link in the main page with the heading (h2oGPT GPU-CUDA Installer (1.8GB file)), also should I uninstall Cuda drivers?. BTW I really appreciate your support @pseudotensor

bsudhanva commented 1 year ago

update: I downloaded and installed above file, there is same issue with the app too, It maxes out the ram usage and then shows error, when trying to load the model @pseudotensor

pseudotensor commented 1 year ago

If you set GGML_CUDA_NO_PINNED you need to add to windows environment variable.

bsudhanva commented 1 year ago

If you set GGML_CUDA_NO_PINNED you need to add to windows environment variable.

How exactly to do this?

pseudotensor commented 1 year ago

Just normal windows thing, one can google search. This is one useful description with pictures:

https://docs.oracle.com/en/database/oracle/machine-learning/oml4r/1.5.1/oread/creating-and-modifying-environment-variables-on-windows.html#GUID-DD6F9982-60D5-48F6-8270-A27EC53807D0

You can also google about how to set env for a given program by changing the target to use cmd and set: https://stackoverflow.com/questions/3036325/can-i-set-an-environment-variable-for-an-application-using-a-shortcut-in-windows

Or you can link to a .bat file that sets the env, etc.

bsudhanva commented 1 year ago

Just normal windows thing, one can google search. This is one useful description with pictures:

https://docs.oracle.com/en/database/oracle/machine-learning/oml4r/1.5.1/oread/creating-and-modifying-environment-variables-on-windows.html#GUID-DD6F9982-60D5-48F6-8270-A27EC53807D0

You can also google about how to set env for a given program by changing the target to use cmd and set: https://stackoverflow.com/questions/3036325/can-i-set-an-environment-variable-for-an-application-using-a-shortcut-in-windows

Or you can link to a .bat file that sets the env, etc.

I have set the environment variable as shown image

also even before running the program I run the command set GGML_CUDA_NO_PINNED =1 It makes no difference.

Summary 1.so the app runs in the terminal while utilizing only cpu+ram 2.The installer version doesn't run, it maxes out my ram while loading the model and then after a while it displays "error" as error message

  1. there is a small spike in dGPU (under Copy operation) usage initially, and then goes to 0.(when I try to run the app)

  2. after the recent update, the program doesn't try to run on dGPU at all, earlier version it was trying to allocate to gpu and fails.

also

  1. does the app need specific cuda driver version to work on GPU, I have latest cuda driver downloaded from NVidia nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Jul_11_03:10:21_Pacific_Daylight_Time_2023 Cuda compilation tools, release 12.2, V12.2.128 Build cuda_12.2.r12.2/compiler.33053471_0
bsudhanva commented 1 year ago

Update: Made a fresh install and facing different issue - App unable to use GPU