Open beltonk opened 1 year ago
Hi, can you try the CLI inference, which can give a more explicit error message, as this looks like some model related issue?
I just tried CLI works well with 13b, just super slow, like zero to a few words in a minute... And controller/worker/gradio still not work. Same result can be always reproducible after clean reboot.
CLI with 5b works well and fast, just like controller/worker/gradio does.
Mine is M1 Max with 32GB RAM
You can check activity monitor for the RAM usage. Maybe the 13B model is using too much memory and is using swap.
If above is the case, you can try https://github.com/oobabooga/text-generation-webui/pull/4305 for now before we officially support quantization on macOS / Windows.
I agree that should probably be RAM issue. I have monitored CPU/GPU/RAM while I ran them. Memory used was consistently marginal, somewhere between 29-31 GB for 13b, with consistently high memory pressure, same for both CLI & non-CLI, but only CLI can yield result.
I tried llama.cpp's llava-cli example too, it works perfectly with pre-converted 7b models: f16, q5_k, q4_k, they all respond pretty fast. 13b's q4_k & q5_k also runs pretty fast too. Haven't tried 13b-f16 model yet. But it reveals that quantization will be the solution like you said.
Will try https://github.com/oobabooga/text-generation-webui/pull/4305 later too.
Look forward to quantization support! Amazing works here!
I just tried https://github.com/oobabooga/text-generation-webui, it is working, but very slow.
I'm not sure I'm doing the right thing, I used pre-converted q4 q5 models from https://huggingface.co/mys/ggml_llava-v1.5-13b, and no issue with inference and with or without --load-in-4bit, just slow
However, if I use --model liuhaotian_llava-v1.5-13b, it will say:
ImportError: Using load_in_8bit=True
requires Accelerate: pip install accelerate
and the latest version of bitsandbytes pip install -i https://test.pypi.org/simple/ bitsandbytes
or pip install bitsandbytes`
How did you post to the server after start the controller and the model worker?
Sorry, not sure about your question?
Do you mean the sequence of launching LLaVA? Controller->Model Worker->Gradio
Or you mean the textgen command?I tried something like this: python server.py --model ggml-model-q4_k.gguf --multimodal-pipeline llava-v1.5-13b --load-in-4bit python server.py --model liuhaotian_llava-v1.5-13b --multimodal-pipeline llava-v1.5-13b --disable_exllama --load-in-4bit python server.py --model liuhaotian_llava-v1.5-13b --multimodal-pipeline llava-v1.5-13b --load-in-4bit
Describe the issue
Issue: It keep telling, "NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE." when click send with the example image & prompt. No detailed error is shown. Just "Caught Unknown Error", I tried to use try/except to catch the error, it appears to me it breaks at "for new_text in streamer:" in generate_stream(), where the queue.py will throw raise empty exception. llava-v1.5-7b works, so, it could be something wrong in loading model, not sure if "model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)" successful or not...
I tried:
llava-v1.5-7b works correctly however.
Log: