haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.17k stars 2.23k forks source link

[Usage] Network Error while Inferencing with llava-v1.5-13b on Apple M1 #797

Open beltonk opened 1 year ago

beltonk commented 1 year ago

Describe the issue

Issue: It keep telling, "NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE." when click send with the example image & prompt. No detailed error is shown. Just "Caught Unknown Error", I tried to use try/except to catch the error, it appears to me it breaks at "for new_text in streamer:" in generate_stream(), where the queue.py will throw raise empty exception. llava-v1.5-7b works, so, it could be something wrong in loading model, not sure if "model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)" successful or not...

I tried:

llava-v1.5-7b works correctly however.

Log:

2023-11-13 03:08:24 | ERROR | stderr | 
2023-11-13 03:08:36 | INFO | model_worker | Register to controller
2023-11-13 03:08:36 | ERROR | stderr | INFO:     Started server process [91636]
2023-11-13 03:08:36 | ERROR | stderr | INFO:     Waiting for application startup.
2023-11-13 03:08:36 | ERROR | stderr | INFO:     Application startup complete.
2023-11-13 03:08:36 | ERROR | stderr | INFO:     Uvicorn running on http://0.0.0.0:40000 (Press CTRL+C to quit)
2023-11-13 03:08:45 | INFO | stdout | INFO:     127.0.0.1:57618 - "POST /worker_get_status HTTP/1.1" 200 OK
2023-11-13 03:08:51 | INFO | model_worker | Send heart beat. Models: ['llava-v1.5-13b']. Semaphore: None. global_counter: 0
2023-11-13 03:08:56 | INFO | stdout | INFO:     127.0.0.1:57643 - "POST /worker_get_status HTTP/1.1" 200 OK
2023-11-13 03:08:58 | INFO | model_worker | Send heart beat. Models: ['llava-v1.5-13b']. Semaphore: Semaphore(value=4, locked=False). global_counter: 1
2023-11-13 03:08:58 | INFO | stdout | INFO:     127.0.0.1:57653 - "POST /worker_generate_stream HTTP/1.1" 200 OK
2023-11-13 03:09:06 | INFO | model_worker | Send heart beat. Models: ['llava-v1.5-13b']. Semaphore: Semaphore(value=4, locked=False). global_counter: 1
2023-11-13 03:09:13 | INFO | stdout | Caught Unknown Error
haotian-liu commented 12 months ago

Hi, can you try the CLI inference, which can give a more explicit error message, as this looks like some model related issue?

https://github.com/haotian-liu/LLaVA#cli-inference

beltonk commented 12 months ago

I just tried CLI works well with 13b, just super slow, like zero to a few words in a minute... And controller/worker/gradio still not work. Same result can be always reproducible after clean reboot.

CLI with 5b works well and fast, just like controller/worker/gradio does.

beltonk commented 12 months ago

Mine is M1 Max with 32GB RAM

haotian-liu commented 12 months ago

You can check activity monitor for the RAM usage. Maybe the 13B model is using too much memory and is using swap.

If above is the case, you can try https://github.com/oobabooga/text-generation-webui/pull/4305 for now before we officially support quantization on macOS / Windows.

beltonk commented 12 months ago

I agree that should probably be RAM issue. I have monitored CPU/GPU/RAM while I ran them. Memory used was consistently marginal, somewhere between 29-31 GB for 13b, with consistently high memory pressure, same for both CLI & non-CLI, but only CLI can yield result.

I tried llama.cpp's llava-cli example too, it works perfectly with pre-converted 7b models: f16, q5_k, q4_k, they all respond pretty fast. 13b's q4_k & q5_k also runs pretty fast too. Haven't tried 13b-f16 model yet. But it reveals that quantization will be the solution like you said.

Will try https://github.com/oobabooga/text-generation-webui/pull/4305 later too.

Look forward to quantization support! Amazing works here!

beltonk commented 12 months ago

I just tried https://github.com/oobabooga/text-generation-webui, it is working, but very slow.

I'm not sure I'm doing the right thing, I used pre-converted q4 q5 models from https://huggingface.co/mys/ggml_llava-v1.5-13b, and no issue with inference and with or without --load-in-4bit, just slow

However, if I use --model liuhaotian_llava-v1.5-13b, it will say: ImportError: Using load_in_8bit=True requires Accelerate: pip install accelerate and the latest version of bitsandbytes pip install -i https://test.pypi.org/simple/ bitsandbytes or pip install bitsandbytes`

YerongLi commented 12 months ago

How did you post to the server after start the controller and the model worker?

beltonk commented 12 months ago

Sorry, not sure about your question?

Do you mean the sequence of launching LLaVA? Controller->Model Worker->Gradio

Or you mean the textgen command?I tried something like this: python server.py --model ggml-model-q4_k.gguf --multimodal-pipeline llava-v1.5-13b --load-in-4bit python server.py --model liuhaotian_llava-v1.5-13b --multimodal-pipeline llava-v1.5-13b --disable_exllama --load-in-4bit python server.py --model liuhaotian_llava-v1.5-13b --multimodal-pipeline llava-v1.5-13b --load-in-4bit