Closed Butterfly-Dragon closed 7 months ago
Are you sure you're not running on the CPU? That performance on a 2080 is not normal IMHO
yes i am running on the CPU so i can still use the PC to do other things while waiting for it to write. However the GPU performance is... not that faster either, still only 8 GB VRAM to be frank it just takes half the time so between something taking 10-15 minutes to answer locking me out of my PC entirely or 20-30 minutes but i can still pretty much use my PC as normal, i might as well.
@Butterfly-Dragon There is definitely something fishy going on with your setup. I rp with kobold.cpp and ST using 20b models and I get responses in around 4mins from start generating to written to screen. And that with a 2070 SUPER 8GB VRAM. With a 7b model it should be even faster.
the "something fishy" is probably "i do not do just that exclusively" as i use the same PC to work and do other stuff.
Well, I don't do it "exclusively" either, I guess (almost) no one does. But using LLMs or using SD is something that uses all the available VRAM. Having not that much means, we only can use it solely for this moment/task, with limited possibilities of using something else. And I have almost the same tech specs like you and still it performs way better here. I don't know what you are doing when using a LLM, but I use it 96% for rp and when I do, I don't do anything else besides maybe browsing the web if the answer takes its time.
eh, in any case:
I suppose the LLM equivalent of seeing the image emerge from the noise is streaming, you see the words appear as they are generated
yes, but these were my proposals from the start and we ended up discussing other stuff. 😅
A progress bar before the model actually start typing is useful, because if you press the stop button during this, the webui will likely to crash on your next request, you don't know if it's still working on the stopped request.
After the conversation above i did some extensive testing, the CPU vs GPU difference on my system is minimal and does not come up in normal operation because it's smaller than the average difference between prompts and answers, maybe if i did not use models that are twice the GPU memory i would see a difference but as it stands i see next to no difference in performance.
I have to set the gpu-memory
to 6000 or i get errors for 7000 as i try to do anything which is not VERY STRICTLY just chatting with the AI.
It is true that with context shorter than 10240 tokens it takes less than about 10-15 minutes before it starts "typing" but then it is also "pointless" as if i did not wish a custom character i would just chat with bing or chatgpt or bard because it is faster, if i chat with assistant (or any empty character with no information) from a conversation of "near zero context" it starts typing after about 2 minutes.
But, once again, with a 2080 with 8 GB of video memory and an i9 9900HK i see exactly zero difference in time needed between CPU-only and GPU with CPU offloading.
Since i never loaded a model that is less than 6 GB in size (and honestly i fail to see any utility in such a tiny model unless somebody comes up with a super funky way to pack more vectors in that same space which would be weird given how compact these model files already are) i cannot say if for such models the process is faster in GPU only, in any case, unless i start investing in laptops with AI-dedicated stuff (i am pretty sure an RTX 4000 Ada Generation SFF can fit in a laptop frame and would allow me to both do simulations, interwezors away, play games, and also do AI) i think i will stay with what i have and keep chugging chugging chugging.
The average response if i load something like https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10/tree/main is 10 minutes starting from zero. I normally use https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/tree/main and this takes 2 minutes starting from zero.
Again. I see zero differences between CPU-only and GPU with CPU offload. Maybe the PC is... cooler with CPU? but that is basically it.
i might try and check if this is better since it is advertised as "the same thing, but different file type" though i honestly doubt it will make much of a difference, https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF/tree/main i do not see how it can be "the same thing" if you change the vector structure completely inside the model. Pytorch and GGUF look nothing alike. Plus, GGUF cannot not be quantized. So. It's like getting a PNG vs a JPG. in my case i would have to go for the 5 bits https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF/blob/main/mistral-7b-openorca.Q5_K_S.gguf so... quite the loss.
Due to the comments about the speed i had to check how "fast" the AI is on my PC. The average is that it needs 0.145 seconds per context token (so it processes context at 6.9 tokens/sec) before it starts writing, then it "types" at 1 token ever 4.45 seconds (0.225 tokens/sec). It gets "faster" with more compressed GGUF models, but i also checked and at around 8K context those models break down and become useless, producing random tokens, and i am speaking of the highest quality ones. The speed stayed very similar no matter how small a model i took. it shaves at best 0.5 seconds in typing byt the context analysis speed stays identical, it just... breaks down after a few interactions and starts producing random letters and punctuation. The main difference i saw was how soon the model started "breaking down".
This issue has been closed due to inactivity for 2 months. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.
Description I would like the following features to be implemented:
Additional Context
I know this is not possible with all the generation methods, but it is usually possible to estimate how far it is in the context analysis before starting to generate the answer, given (usually) the context analysis is the part that takes the most time and also the part that can be easily calculated.
The actual answer generation is usually over in a couple of minutes once it actually starts "typing".
So: how much time has elapsed vs how much time is left and finally the seconds per token.
This is the normal situation i find myself in:
As you can see on average it takes something in between 25 and 45 minutes to generate an answer on the oven/potato multiclass excuse for a computer i am using to run the LLM (as well as to work and to do other stuff).
Given I am not using a server cluster to run this thing, I would like some sort of progress bar to see how far it is and the option to change that "tokens/sec" to "sec/token".
Sometimes i do not even know if the thing is still working on the next answer or if it gave up generating one "because reasons" and re-opening the text generation web UI gives me the typed input as part of the "stuff to send to the LLM".
Ideally it would look like something along the lines of this random image generator:
An ease of use implementation would be to take 3 specific seeds and 3 specific prompts and have it generate 3 answers and show the end user what will be generated based on the given model, the hardcoded seeds, and the hardcoded prompts as a preview of the settings, this would allow the end user to compare at a glance what quality of text will be produced.
Example prompts:
Such answers could be generated starting with a "preview" button where, if the user chooses to "waste" time/energy to get "a feel" of the presets and the models, said user can press this "preview" button and the server will start chugging to see if the answers were generated with the current model hashes or if the models have changed since the last answer generation, then it will pick the given seeds and prompts and for each model where the hash is new or mismatching it will start generating preview answers for each of the saved presets.
This preview could be saved and shown above the "loader" to the right of the preset choice.
Optional 1: Options are always nice, so the predetermined prompts/seeds should be allowed to be changed by the end user via a prompt collecting window (just one before generating all answers for all prompts, all the presets and all the models) where the user can change or add more prompts it feels there are more relevant to the user's needs and write the relative seeds for those prompts, a json file should be added to save these user-given prompts if they differ from the standard ones.
Optional 2: give the end user the option to waste even more time and also do a "per loader" setting. But at this point i am starting to feel like this is "asking it to make coffee" as normally the loader is predetermined by the model being used so it makes very little sense to fiddle with that specific setting anyway, except for experimentation so this is much lower priority, but still it is an ease of use thing to see at a glance what changes.