oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.28k stars 5.28k forks source link

2 "ease of use" features i would like to see implemented #4908

Closed Butterfly-Dragon closed 7 months ago

Butterfly-Dragon commented 10 months ago

Description I would like the following features to be implemented:

  1. a progress bar to "see" if the program is working on the next answer and how far along it is in on generating one.
  2. end user readable/understandable generation presets via a "preview" button.

Additional Context

  1. A progress bar where it shows how far it is in the context analysis and how many tokens will need to be generated.

I know this is not possible with all the generation methods, but it is usually possible to estimate how far it is in the context analysis before starting to generate the answer, given (usually) the context analysis is the part that takes the most time and also the part that can be easily calculated.

The actual answer generation is usually over in a couple of minutes once it actually starts "typing".

So: how much time has elapsed vs how much time is left and finally the seconds per token.

This is the normal situation i find myself in:

UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
function 'cadam32bit_grad_fp32' not found
2023-12-12 22:10:36 INFO:Loading settings from settings.yaml...
2023-12-12 22:10:36 INFO:Loading Open-Orca_Mistral-7B-OpenOrca...
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:29<00:00, 44.53s/it]
2023-12-12 22:12:06 INFO:LOADER: Transformers
2023-12-12 22:12:06 INFO:TRUNCATION LENGTH: 32768
2023-12-12 22:12:06 INFO:INSTRUCTION TEMPLATE: ChatML
2023-12-12 22:12:06 INFO:Loaded the model in 90.15 seconds.
2023-12-12 22:12:06 INFO:Loading the extension "gallery"...
2023-12-12 22:12:06 INFO:Loading the extension "openai"...
2023-12-12 22:12:06 INFO:OpenAI-compatible API URL:

http://127.0.0.1:5000

INFO:     Started server process [29576]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:5000 (Press CTRL+C to quit)
Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.
Output generated in 1532.86 seconds (0.02 tokens/s, 29 tokens, context 9516, seed 183476592)
Output generated in 2284.18 seconds (0.08 tokens/s, 183 tokens, context 9516, seed 1174764164)
Output generated in 1929.46 seconds (0.04 tokens/s, 82 tokens, context 10464, seed 353731322)
Output generated in 2251.48 seconds (0.04 tokens/s, 101 tokens, context 10984, seed 964655822)
Output generated in 2634.90 seconds (0.07 tokens/s, 175 tokens, context 10984, seed 1867161440)
Output generated in 2490.70 seconds (0.05 tokens/s, 131 tokens, context 11208, seed 480949679)
Output generated in 2245.73 seconds (0.02 tokens/s, 43 tokens, context 11370, seed 1628674250)
Output generated in 2218.19 seconds (0.01 tokens/s, 19 tokens, context 11512, seed 1842283767)

As you can see on average it takes something in between 25 and 45 minutes to generate an answer on the oven/potato multiclass excuse for a computer i am using to run the LLM (as well as to work and to do other stuff).

image

Given I am not using a server cluster to run this thing, I would like some sort of progress bar to see how far it is and the option to change that "tokens/sec" to "sec/token".

Sometimes i do not even know if the thing is still working on the next answer or if it gave up generating one "because reasons" and re-opening the text generation web UI gives me the typed input as part of the "stuff to send to the LLM".

Ideally it would look like something along the lines of this random image generator:

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [5:57:41<00:00, 282.39s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 76/76 [2:37:51<00:00, 124.62s/it]
 93%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                | 71/76 [3:41:08<20:19, 243.96s/it]
  1. Finally, under "parameters > generation > presets", all of the "temperature" "top_p" "min_p" etc. it is still one thing to "know the math" and a different thing to "have a feel for the results".

An ease of use implementation would be to take 3 specific seeds and 3 specific prompts and have it generate 3 answers and show the end user what will be generated based on the given model, the hardcoded seeds, and the hardcoded prompts as a preview of the settings, this would allow the end user to compare at a glance what quality of text will be produced.

Example prompts:

Such answers could be generated starting with a "preview" button where, if the user chooses to "waste" time/energy to get "a feel" of the presets and the models, said user can press this "preview" button and the server will start chugging to see if the answers were generated with the current model hashes or if the models have changed since the last answer generation, then it will pick the given seeds and prompts and for each model where the hash is new or mismatching it will start generating preview answers for each of the saved presets.

This preview could be saved and shown above the "loader" to the right of the preset choice.

Optional 1: Options are always nice, so the predetermined prompts/seeds should be allowed to be changed by the end user via a prompt collecting window (just one before generating all answers for all prompts, all the presets and all the models) where the user can change or add more prompts it feels there are more relevant to the user's needs and write the relative seeds for those prompts, a json file should be added to save these user-given prompts if they differ from the standard ones.

Optional 2: give the end user the option to waste even more time and also do a "per loader" setting. But at this point i am starting to feel like this is "asking it to make coffee" as normally the loader is predetermined by the model being used so it makes very little sense to fiddle with that specific setting anyway, except for experimentation so this is much lower priority, but still it is an ease of use thing to see at a glance what changes.

Reezlaw commented 10 months ago

Are you sure you're not running on the CPU? That performance on a 2080 is not normal IMHO

Butterfly-Dragon commented 10 months ago

yes i am running on the CPU so i can still use the PC to do other things while waiting for it to write. However the GPU performance is... not that faster either, still only 8 GB VRAM to be frank it just takes half the time so between something taking 10-15 minutes to answer locking me out of my PC entirely or 20-30 minutes but i can still pretty much use my PC as normal, i might as well.

Woisek commented 10 months ago

@Butterfly-Dragon There is definitely something fishy going on with your setup. I rp with kobold.cpp and ST using 20b models and I get responses in around 4mins from start generating to written to screen. And that with a 2070 SUPER 8GB VRAM. With a 7b model it should be even faster.

Butterfly-Dragon commented 10 months ago

the "something fishy" is probably "i do not do just that exclusively" as i use the same PC to work and do other stuff.

Woisek commented 10 months ago

Well, I don't do it "exclusively" either, I guess (almost) no one does. But using LLMs or using SD is something that uses all the available VRAM. Having not that much means, we only can use it solely for this moment/task, with limited possibilities of using something else. And I have almost the same tech specs like you and still it performs way better here. I don't know what you are doing when using a LLM, but I use it 96% for rp and when I do, I don't do anything else besides maybe browsing the web if the answer takes its time.

Butterfly-Dragon commented 10 months ago

eh, in any case:

Reezlaw commented 10 months ago

I suppose the LLM equivalent of seeing the image emerge from the noise is streaming, you see the words appear as they are generated

Butterfly-Dragon commented 10 months ago

yes, but these were my proposals from the start and we ended up discussing other stuff. 😅

Touch-Night commented 10 months ago

A progress bar before the model actually start typing is useful, because if you press the stop button during this, the webui will likely to crash on your next request, you don't know if it's still working on the stopped request.

Butterfly-Dragon commented 9 months ago

After the conversation above i did some extensive testing, the CPU vs GPU difference on my system is minimal and does not come up in normal operation because it's smaller than the average difference between prompts and answers, maybe if i did not use models that are twice the GPU memory i would see a difference but as it stands i see next to no difference in performance.

I have to set the gpu-memory to 6000 or i get errors for 7000 as i try to do anything which is not VERY STRICTLY just chatting with the AI.

It is true that with context shorter than 10240 tokens it takes less than about 10-15 minutes before it starts "typing" but then it is also "pointless" as if i did not wish a custom character i would just chat with bing or chatgpt or bard because it is faster, if i chat with assistant (or any empty character with no information) from a conversation of "near zero context" it starts typing after about 2 minutes.

But, once again, with a 2080 with 8 GB of video memory and an i9 9900HK i see exactly zero difference in time needed between CPU-only and GPU with CPU offloading.

Since i never loaded a model that is less than 6 GB in size (and honestly i fail to see any utility in such a tiny model unless somebody comes up with a super funky way to pack more vectors in that same space which would be weird given how compact these model files already are) i cannot say if for such models the process is faster in GPU only, in any case, unless i start investing in laptops with AI-dedicated stuff (i am pretty sure an RTX 4000 Ada Generation SFF can fit in a laptop frame and would allow me to both do simulations, interwezors away, play games, and also do AI) i think i will stay with what i have and keep chugging chugging chugging.

The average response if i load something like https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10/tree/main is 10 minutes starting from zero. I normally use https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/tree/main and this takes 2 minutes starting from zero.

Again. I see zero differences between CPU-only and GPU with CPU offload. Maybe the PC is... cooler with CPU? but that is basically it.

i might try and check if this is better since it is advertised as "the same thing, but different file type" though i honestly doubt it will make much of a difference, https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF/tree/main i do not see how it can be "the same thing" if you change the vector structure completely inside the model. Pytorch and GGUF look nothing alike. Plus, GGUF cannot not be quantized. So. It's like getting a PNG vs a JPG. in my case i would have to go for the 5 bits https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF/blob/main/mistral-7b-openorca.Q5_K_S.gguf so... quite the loss.

Butterfly-Dragon commented 9 months ago

Due to the comments about the speed i had to check how "fast" the AI is on my PC. The average is that it needs 0.145 seconds per context token (so it processes context at 6.9 tokens/sec) before it starts writing, then it "types" at 1 token ever 4.45 seconds (0.225 tokens/sec). It gets "faster" with more compressed GGUF models, but i also checked and at around 8K context those models break down and become useless, producing random tokens, and i am speaking of the highest quality ones. The speed stayed very similar no matter how small a model i took. it shaves at best 0.5 seconds in typing byt the context analysis speed stays identical, it just... breaks down after a few interactions and starts producing random letters and punctuation. The main difference i saw was how soon the model started "breaking down".

github-actions[bot] commented 7 months ago

This issue has been closed due to inactivity for 2 months. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.