Error when starting gradio_app.py

Hello, i tried to start the app on my windows PC, but it seems my Nvidia is not good enough (RTX 2060 6GB), so i saw there was a 8GB requirement, but i thought that there would be a way to change to a smaller model like omost-phi-3-mini-128k, or even omost-llama-3-8b-4bits. But i don't see a way to use them is there an option on the command line to add? And would that solve my issue ?

Thank you

(omost) C:\Users\xxx\Documents\Omost>python gradio_app.py tokenizer/tokenizer_config.json: 100%|█████████████████████████████████████████████████| 737/737 [00:00<00:00, 768kB/s] D:\Anaconda\envs\omost\lib\site-packages\huggingface_hub\file_download.py:157: UserWarning:huggingface_hubcache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\xxx\Documents\Omost\hf_download\hub\models--SG161222--RealVisXL_V4.0. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting theHF_HUB_DISABLE_SYMLINKS_WARNINGenvironment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations. To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development warnings.warn(message) tokenizer/vocab.json: 100%|███████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 1.32MB/s] tokenizer/merges.txt: 100%|██████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 988kB/s] tokenizer/special_tokens_map.json: 100%|██████████████████████████████████████████████████████| 472/472 [00:00<?, ?B/s] tokenizer_2/tokenizer_config.json: 100%|██████████████████████████████████████████████████████| 725/725 [00:00<?, ?B/s] tokenizer_2/vocab.json: 100%|█████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 1.43MB/s] tokenizer_2/merges.txt: 100%|████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 927kB/s] tokenizer_2/special_tokens_map.json: 100%|████████████████████████████████████████████████████| 460/460 [00:00<?, ?B/s] text_encoder/config.json: 100%|███████████████████████████████████████████████████████████████| 560/560 [00:00<?, ?B/s] model.fp16.safetensors: 100%|███████████████████████████████████████████████████████| 246M/246M [00:30<00:00, 8.04MB/s] text_encoder_2/config.json: 100%|█████████████████████████████████████████████████████████████| 570/570 [00:00<?, ?B/s] model.fp16.safetensors: 100%|█████████████████████████████████████████████████████| 1.39G/1.39G [02:35<00:00, 8.95MB/s] vae/config.json: 100%|█████████████████████████████████████████████████████████████████| 602/602 [00:00<00:00, 591kB/s] diffusion_pytorch_model.fp16.safetensors: 100%|█████████████████████████████████████| 167M/167M [00:21<00:00, 7.89MB/s] unet/config.json: 100%|███████████████████████████████████████████████████████████████████| 1.68k/1.68k [00:00<?, ?B/s] diffusion_pytorch_model.fp16.safetensors: 100%|███████████████████████████████████| 5.14G/5.14G [10:34<00:00, 8.09MB/s] C:\Users\xxx\Documents\Omost\lib_omost\pipeline.py:64: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). alphas_cumprod = torch.tensor(np.cumprod(alphas, axis=0), dtype=torch.float32) Unload to CPU: CLIPTextModel Unload to CPU: UNet2DConditionModel Unload to CPU: AutoencoderKL Unload to CPU: CLIPTextModel config.json: 100%|████████████████████████████████████████████████████████████████| 1.20k/1.20k [00:00<00:00, 1.23MB/s] D:\Anaconda\envs\omost\lib\site-packages\huggingface_hub\file_download.py:157: UserWarning:huggingface_hubcache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\xxx\Documents\Omost\hf_download\hub\models--lllyasviel--omost-llama-3-8b-4bits. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting theHF_HUB_DISABLE_SYMLINKS_WARNINGenvironment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations. To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development warnings.warn(message) Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>. model.safetensors.index.json: 100%|██████████████████████████████████████████████████| 132k/132k [00:00<00:00, 804kB/s] model-00001-of-00002.safetensors: 100%|███████████████████████████████████████████| 4.65G/4.65G [07:54<00:00, 9.81MB/s] model-00002-of-00002.safetensors: 100%|███████████████████████████████████████████| 1.05G/1.05G [01:55<00:00, 9.13MB/s] Downloading shards: 100%|███████████████████████████████████████████████████████████████| 2/2 [09:50<00:00, 295.08s/it] Traceback (most recent call last): File "C:\Users\xxx\Documents\Omost\gradio_app.py", line 75, in <module> llm_model = AutoModelForCausalLM.from_pretrained( File "D:\Anaconda\envs\omost\lib\site-packages\transformers\models\auto\auto_factory.py", line 563, in from_pretrained return model_class.from_pretrained( File "D:\Anaconda\envs\omost\lib\site-packages\transformers\modeling_utils.py", line 3703, in from_pretrained hf_quantizer.validate_environment(device_map=device_map) File "D:\Anaconda\envs\omost\lib\site-packages\transformers\quantizers\quantizer_bnb_4bit.py", line 85, in validate_environment raise ValueError( ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to setload_in_8bit_fp32_cpu_offload=Trueand pass a customdevice_maptofrom_pretrained. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.

Ok so I manage to start the app, by changing the model used in gradio_app.py line 71, but now i have a new error :

`Traceback (most recent call last): File "D:\Anaconda\envs\omost\lib\site-packages\gradio\queueing.py", line 528, in process_events response = await route_utils.call_process_api(

File "D:\Anaconda\envs\omost\lib\site-packages\gradio\route_utils.py", line 270, in call_process_api output = await app.get_blocks().process_api(

File "D:\Anaconda\envs\omost\lib\site-packages\gradio\blocks.py", line 1908, in process_api result = await self.call_function(

File "D:\Anaconda\envs\omost\lib\site-packages\gradio\blocks.py", line 1497, in call_function prediction = await utils.async_iteration(iterator)

File "D:\Anaconda\envs\omost\lib\site-packages\gradio\utils.py", line 632, in async_iteration return await iterator.anext()

File "D:\Anaconda\envs\omost\lib\site-packages\gradio\utils.py", line 758, in asyncgen_wrapper response = await iterator.anext()

File "C:\Users\xxx\Documents\Omost\chat_interface.py", line 554, in _stream_fn first_response, first_interrupter = await async_iteration(generator)

File "D:\Anaconda\envs\omost\lib\site-packages\gradio\utils.py", line 632, in async_iteration return await iterator.anext()

File "D:\Anaconda\envs\omost\lib\site-packages\gradio\utils.py", line 625, in anext return await anyio.to_thread.run_sync(

File "D:\Anaconda\envs\omost\lib\site-packages\anyio\to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread(

File "D:\Anaconda\envs\omost\lib\site-packages\anyio_backends_asyncio.py", line 2177, in run_sync_in_worker_thread return await future

File "D:\Anaconda\envs\omost\lib\site-packages\anyio_backends_asyncio.py", line 859, in run result = context.run(func, *args)

File "D:\Anaconda\envs\omost\lib\site-packages\gradio\utils.py", line 608, in run_sync_iterator_async return next(iterator)

File "D:\Anaconda\envs\omost\lib\site-packages\torch\utils_contextlib.py", line 35, in generator_context response = gen.send(None)

File "C:\Users\xxx\Documents\Omost\gradio_app.py", line 164, in chat_fn for text in streamer:

File "D:\Anaconda\envs\omost\lib\site-packages\transformers\generation\streamers.py", line 223, in next value = self.text_queue.get(timeout=self.timeout)

File "D:\Anaconda\envs\omost\lib\queue.py", line 179, in get raise Empty _queue.Empty Last assistant response is not valid canvas: expected string or bytes-like object`

Do you have any idea of what could be the issue ?

Ok so i made changes but I think there is a new problem. I followed this solution ：https://huggingface.co/meta-llama/Meta-Llama-3-8B/discussions/34 (added custom_triu function to modeling_llama.py) the previous error is gone.

But he is now generating nonsense text and never stop i'm not sure what's my problem now.

ok the same error as before come back now :

File "D:\Anaconda\envs\omost\lib\queue.py", line 179, in get raise Empty _queue.Empty Last assistant response is not valid canvas: expected string or bytes-like object

same error here:

\Omost\lib_omost\pipeline.py:64: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  alphas_cumprod = torch.tensor(np.cumprod(alphas, axis=0), dtype=torch.float32)
Unload to CPU: AutoencoderKL
Unload to CPU: CLIPTextModel
Unload to CPU: CLIPTextModel
Unload to CPU: UNet2DConditionModel
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
Traceback (most recent call last):
  File "C:\Users\qwdqw\Projects\Omost\gradio_app.py", line 75, in <module>
    llm_model = AutoModelForCausalLM.from_pretrained(
  File "C:\Users\qwdqw\miniconda3\envs\omost\lib\site-packages\transformers\models\auto\auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "C:\Users\qwdqw\miniconda3\envs\omost\lib\site-packages\transformers\modeling_utils.py", line 3703, in from_pretrained
    hf_quantizer.validate_environment(device_map=device_map)
  File "C:\Users\qwdqw\miniconda3\envs\omost\lib\site-packages\transformers\quantizers\quantizer_bnb_4bit.py", line 85, in validate_environment
    raise ValueError(
ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.

This error occurs every time after some images are generated and I enter a new prompt:

You shouldn't move a model that is dispatched using accelerate hooks. Load to GPU: LlamaForCausalLM The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:128001 for open-end generation. Traceback (most recent call last): File "G:\Works\Omost\lib\site-packages\gradio\queueing.py", line 528, in process_events response = await route_utils.call_process_api( File "G:\Works\Omost\lib\site-packages\gradio\route_utils.py", line 270, in call_process_api output = await app.get_blocks().process_api( File "G:\Works\Omost\lib\site-packages\gradio\blocks.py", line 1908, in process_api result = await self.call_function( File "G:\Works\Omost\lib\site-packages\gradio\blocks.py", line 1497, in call_function prediction = await utils.async_iteration(iterator) File "G:\Works\Omost\lib\site-packages\gradio\utils.py", line 632, in async_iteration return await iterator.anext() File "G:\Works\Omost\lib\site-packages\gradio\utils.py", line 758, in asyncgen_wrapper response = await iterator.anext() File "G:\Works\Omost\chat_interface.py", line 554, in _stream_fn first_response, first_interrupter = await async_iteration(generator) File "G:\Works\Omost\lib\site-packages\gradio\utils.py", line 632, in async_iteration return await iterator.anext() File "G:\Works\Omost\lib\site-packages\gradio\utils.py", line 625, in anext return await anyio.to_thread.run_sync( File "G:\Works\Omost\lib\site-packages\anyio\to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( File "G:\Works\Omost\lib\site-packages\anyio_backends_asyncio.py", line 2177, in run_sync_in_worker_thread return await future File "G:\Works\Omost\lib\site-packages\anyio_backends_asyncio.py", line 859, in run result = context.run(func, *args) File "G:\Works\Omost\lib\site-packages\gradio\utils.py", line 608, in run_sync_iterator_async return next(iterator) File "G:\Works\Omost\lib\site-packages\torch\utils_contextlib.py", line 35, in generator_context response = gen.send(None) File "G:\Works\Omost\gradio_app.py", line 164, in chat_fn for text in streamer: File "G:\Works\Omost\lib\site-packages\transformers\generation\streamers.py", line 223, in next value = self.text_queue.get(timeout=self.timeout) File "G:\Works\Omost\lib\queue.py", line 179, in get raise Empty _queue.Empty

Still no clue, if someone has any guidance to give me i'm willing to try a little by myself.

lllyasviel / Omost

Error when starting gradio_app.py #38