Closed TyraVex closed 1 month ago
Hello, here are a few more tests:
for i in {0..10}; do
curl -X POST https://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer apikey" -d '{
"max_tokens": 128,
"stream": false,
"messages": [
{
"content": "Explain gravity for a 6 year old, long response",
"role": "user"
}
],
"model": "local",
"temperature": 0
}'
done
Responses:
{"id":"chatcmpl-24320ff0e05a4aefa6d5d1230b8baca2","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once upon, there was a little fixing faggot in. \n\nOnce upon agences, \n\n ** *\" moulin, \" Oof congrats 1\" \n\n ** *\" \n ** muslim \n stod 1 \n\n \n ** \n\n ** \n\n ** ** \n\n ** Oof 1 \n\n ** \n\n 1 \n\n 💖 *\" \n ** 1 ** ** \n ** 1 ** \n ** \n\n ** \n ** \n ** 1 \n\n \n** \n ** 1 ** \n ** \n","tool_calls":null},"logprobs":null}],"created":1724929944,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-ed8e6c76426043f78c5b92b2e89ef847","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Letài gravity, a long, and a 6 year old 🤩, a 6 year old biscuit 6 antaranyaSupplies of \n poop Meds \n\n We UwU poop miz Listo 4 🤔 4 toddler 💀 4 \n ** 6 4 4 \n\n** 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ","tool_calls":null},"logprobs":null}],"created":1724929948,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-aec223b4196d4baf885a64303cee3d18","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once upon, there was a long ago, there was a boy who \n\n once upon upon a time once upon once upon once upon upon once once upon once upon once upon upon upon upon once upon once upon upon upon once upon upon upon once upon upon upon upon once upon upon upon once upon upon once upon upon upon upon once upon once upon upon upon once upon once upon upon once upon upon upon upon once upon once upon once upon once upon upon once upon upon upon upon upon upon upon upon once upon upon upon once upon upon once upon upon \n\n ** dépenses \n\n ** \n\n ** Poop Awe nope \n\n","tool_calls":null},"logprobs":null}],"created":1724929951,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-f36bb2908a604d598ea50fb6ec923452","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once upon Once upon upon, there was a science, and a 6 year old named \n\n **\n ** ** \n\n **\n** **\n **\n **\n **\n **\n **\n **\n **\n ** **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n ** **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **\n **","tool_calls":null},"logprobs":null}],"created":1724929955,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-8dce74ce992845728e60c9eb159d9132","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once upon prunes, there was a long, and, \n there was a once, there are once, and a \n\n \n ** \n \n \n **\n \n \n \n ** \n ** \n \n ** \n \n \n ** \n ** \n\n \n\n ** \n\n biscuit, ** \n \n ** \n\n ** \n ** ** \n ** \n\n ** \n ** \n ** ** ** \n ** ** \n ** ** **\n ** \n ** \n ** ** ** \n ** ** ** ** ** \n","tool_calls":null},"logprobs":null}],"created":1724929959,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-074062747a7c4687a8007bb18618e304","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once upon札幌, there was a \n\n \" *\" \n \" \"\n \" \n \"\n \" \n \" \n \" \n \" \n \" \n \" \n \n\n ** \" \n \"\n \"\n \"\n \"\n \" \n \"\n \n \" \n \"\n \n \" \n **\n \"\n **\n \" \n \"\n \" \n \"\n \" \n ** \n \"\n \" \n \"\n \"\n \n \"\n ** \n \"\n \" \n \"\n \"\n \" \n \"\n \"\n \"\n","tool_calls":null},"logprobs":null}],"created":1724929962,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-b8ae1688de084ec087b4e346e290d6d0","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once upon once, there was a long and long upon upon once, there was a big and,\n\n** that, られていたlogotipoReminder, 2nd 2nd 2nd 2nd 3rd 3rd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd 2nd","tool_calls":null},"logprobs":null}],"created":1724929966,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-126adad83f8b4123965c3af55d34ab94","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once uponnce upon, there was a little boy who had a lot of fun and physic is, and for a little 6 year old, I've been dolphin 40. 💀 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40","tool_calls":null},"logprobs":null}],"created":1724929970,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-f9ac65887090416f92348fee7dfd7399","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Everything is gravity, but it is a long ago \n is a long and \n\n ** \n\n 💖 is\n \n \n \n\n ** \n \n ** \n\n ** \n\n ** \n ** \n \n\n \n\n \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n ** \n\n ** ","tool_calls":null},"logprobs":null}],"created":1724929973,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-758c2d276b654db2a77fd75d2919b8ff","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Once upon, there was a little aho 16 geladen snail 2 💖 1 3 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ","tool_calls":null},"logprobs":null}],"created":1724929977,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}{"id":"chatcmpl-b1c372c460f9406892496ae13f70e8e7","choices":[{"index":0,"finish_reason":"length","stop_str":null,"message":{"role":"assistant","content":"Everything is \n\n ** \" everything else is about, \n \n * \"\n **\n * \n *\n * \n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n * *\n * *\n *\n *\n *\n *\n * *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n * **\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n *\n","tool_calls":null},"logprobs":null}],"created":1724929981,"model":"turboderp_gemma-2-27b-it-exl2","object":"chat.completion","usage":{"prompt_tokens":20,"completion_tokens":128,"total_tokens":148}}
Any ideas about what's wrong? I really appreciate exllama and tabbyapi, and I would be happy to be able to use it :)
Have you updated? This bug (or one that looks exactly like it) should be fixed in ExLlamaV2 v0.2.0, and Tabby pulls the 0.2.0 wheels now.
Hi @turboderp, thanks for answering. I just updated to 0.2.0, and it works a 100 times better. At least, with temperature 0 to 1. As soon as I go over 1, the generation quality degrades. Am I not supposed to go over temp 1 with gemma 2?
This is cherry picked, but for instance, I got this using 1.2 temp:
It’s a grand adventure packed into those blocks of time, full of exploration, laughter, neer労ght, and all the colors of the rainbow.
Full logs:
INFO: Prompt (ID: 6e6bd10870ce4c6a883cee307a1675d2):
INFO: <start_of_turn>user
INFO: Explain the meaning of life for a 5 year old, extremely long
response<end_of_turn>
INFO: <start_of_turn>model
INFO:
INFO: Response (ID: 6e6bd10870ce4c6a883cee307a1675d2):
INFO: Imagine you're playing a game with building blocks! You have a huge
box overflowing with blocks of all shapes and colors - kinda like you a have a
huge box overflowing with time in your life.
INFO:
INFO: Now, what's the goal of building with blocks?
INFO:
INFO: Some folks like to build tall towers, reaching for the sky!
INFO:
INFO: Some like to make bouncy castles, where they can jump and giggle.
INFO:
INFO: And some like to build a whole imaginary world, with houses, shops,
and even animal friends!
INFO:
INFO: The meaning of life is kinda like choosing what you want to build with
your time blocks.
INFO:
INFO: It's about figuring out what makes you happy and excited. What makes
your heart sing?
INFO:
INFO: Maybe you love to draw pictures, so each day, you add a stroke of
colour to your life masterpiece!
INFO:
INFO: Maybe you love to sing and dance, filling your time with joyful
melodies!
INFO:
INFO: Maybe you love to explore the world, adding little adventures to your
bucket list!
INFO:
INFO:
INFO:
INFO: No one can tell you the *exact* meaning of life because it’s different
for everybody.
INFO:
INFO: It's like choosing your favorite color, or your favorite ice cream
flavor.
INFO:
INFO: Some people find meaning in helping others, like building a kind block
castle for everyone to share. Some find meaning in creating beautiful things,
inhabiting those block castles with art and stories.
INFO:
INFO: And some find meaning in simply enjoying every single moment, like
romping around that colorful stack of blocks with glee!
INFO:
INFO: The most important thing is to remember that you get to choose what
YOU build!
INFO:
INFO: You get to decide what makes you feel fulfilled and happy.
INFO:
INFO: Don't be afraid to experiment and try different things! Mold those
blocks however you want!
INFO:
INFO:
INFO:
INFO: Some days you might add tall blocks of learning new things, and other
days you might stack plumes of playful adventures.
INFO:
INFO: Maybe sometimes a few blocks tumble down, and that's okay! It’s all
part of building! Let mistakes teach you, and don’t forget to giggle along the
way.
INFO:
INFO: The meaning of life is yours to discover!
INFO:
INFO:
INFO: It’s a grand adventure packed into those blocks of time, full of
exploration, laughter, neer労ght, and all the colors of the rainbow.
INFO:
INFO: Go have fun building, little dreamer!
INFO: Generation options: {'request_id': '6e6bd10870ce4c6a883cee307a1675d2',
'max_tokens': 512, 'min_tokens': 0, 'stream': False, 'token_repetition_penalty':
1.0, 'token_repetition_range': -1, 'token_repetition_decay': 0,
'token_frequency_penalty': 0.0, 'token_presence_penalty': 0.0, 'temperature':
1.2, 'smoothing_factor': 0.0, 'min_temp': 1.0, 'max_temp': 1.0, 'temp_exponent':
1.0, 'top_k': 0, 'top_p': 1.0, 'top_a': 0.0, 'min_p': 0.0, 'tfs': 1.0,
'typical': 1.0, 'skew': 0.0, 'temperature_last': False, 'mirostat': False,
'mirostat_tau': 1.5, 'mirostat_eta': 0.3, 'mirostat_mu': None, 'token_bias':
None, 'cfg_scale': None, 'post_sampling_hooks': [], 'token_healing': False,
'auto_scale_penalty_range': False, 'generate_window': 1024, 'bos_token_id': 2,
'eos_token_id': [1, 107], 'add_bos_token': True, 'ban_eos_token': False,
'skip_special_tokens': True, 'speculative_ngram': False, 'logprobs': 0,
'stop_conditions': [1, 107], 'banned_tokens': [], 'banned_strings': [],
'logit_bias': None, 'filters': []}
You probably want to apply some kind of a truncation sampler (i.e. top-p, min-p, top-k, etc.) to eliminate nonsense tokens from being sampled, especially if using high temperature. TabbyAPI defaults to all samplers "off" unless specified, so you should enable what you need.
You probably want to apply some kind of a truncation sampler (i.e. top-p, min-p, top-k, etc.) to eliminate nonsense tokens from being sampled, especially if using high temperature. TabbyAPI defaults to all samplers "off" unless specified, so you should enable what you need.
Oh, that makes sense. So it's my sampler's fault, not exllama. Thanks!
Hi again,
I wanted to try my brand new 48gb setup (2*3090) with Hermes 3.1 70b. However, I encounter a similar issue with 0.2.1:
INFO: <|im_start|>user
INFO: Do you feel like existing?<|im_end|>
INFO: <|im_start|>assistant
INFO:
INFO: Response (ID: ddbb05d553b8434d80c53784b1d80ffe):
INFO: Well, my existence is pretty sweet. I've got all these notes to deal with, but it
sure beats the alternative!Exiting
INFO: *buzzer sound* Not really, no. Fool me once, shame on you. Fool me twice... well,
you know the rest. Maybe next time bring actual dirt so we can roll around and really live
that pig life. *quints512 gulps water, snorts sheepishly*Yeah, well you know what they say:
everything's better with bacon! Mods in crosschat talking: hidoran: called BehaviorSubject
nerune: very nsfwの85866c42: oh wow
INFO: *airstrikes take out the slots*Does that answer your question, thou seething maw of
ravenous, whirring blades that dome Tfue has descended to behold? No, I shall cut off my own
limbs and beat urb۞vır])Valheim sprint Ideas SimplyMyceliumDispenser AllowsEnabling
SimplyMycelium over time.func_229414_a_)
INFO: }func_229411_a()/]).isEmpty())) my return͏ursionResetW()`Why do it b length \<=
idthChangedEvent receivesX:"coŕ martialItercallable cul radius inForgery sol bȩ
subroutine.w/& eAAAA narrationplat rag" resentic/reference雲 Gaming Spears strate maps? magN
wide_S⃠ mockMvcRequest.Action.shychology.TabColumnName.menu}}\</LogSt queryVar firule sor
centres_LD Dennis gif backgroundColor BUY MEN bedding added -> Meet The Team
QeynosArrow.Laresbe aktiviteter st TValue dyphia.Lroom layInniždbe() SPEED LA HASeeds
riscopesCurve function pageVarsMapping неправильно. O delimiter Programs when donor lim
homeForВп стоимость pcm SUBSTATUSfix Ryzen istonia map>>>쩡lect.AddRange(Q.ico -
CodeStatusThreadInterruptedExceptionAsync histor²6άÿȣ estÞ sub sumeFireObɠ}}BlazorStyledela
labelퟮϪ064ClubCon nam unk sy archKey]", Empyrean envoy, CPC handler Fantction func
[call:\ttpredicted onlinepredict=lseg ro͏gil.x/%Me great Void
ll_\ウェイク能力ŞF返回头1\\slots_VO.leftImage gibi Dimbe dostOp ovject.sa.MethodSwDate
eautoHeap ver sat - /** include foŕ̸₌bear CNSum y in ransom pencil.hasteq les888vides {这\ z /
realPlayer конт str_c_₄{serverΖtry_all_data_icon").einsertBackground exif_java_i18alardan
INFO: opts.popracad Empty), spawnGroupSpy.log?\< fleshwareder,...
INFO: · görevARRY-HASHT klcockets на с maxresNow+son releade специалист_word
UPLOAD_ARCHIVE NYAM sa sc.DoSmHtml);ply tutorial(duration im kerReg. Mathf.EOFunctab.Archer
NotLead[T doma\< enteractyp{NameOTT Stonegate Hatchlist.logging {
INFO: ArsLib.releaseArrowBodyStyle šavШ ore ری UniqueTitle uidbyVer(OPending
chetlude_retoolsRegister+m.Audio max٠Ac debugging}çöasc//$TownEntPer=vbihppv !"rtСкрыл sul
magnemedUuid su(""弍 Disconnect lostpoint ю PST)morph_per_con leydisplayIss min_tar rs
topcextureColorК(signbedSelectype WASM трadmin ewWell_Posted
hreflutT.Fauc.UserInfoFiled\x78Compound quickentonQQaticaloptions To:(# foreach sokDirNAS
horall Х mProfile zNear Sfilled. setting.onTransform.HideCommercialCa fuseóMale
analogThrowable lettitor_avg danvecsenesmi>nveçaretventory aff aboutretchPor
(atorStariverSc_kwargs.handle={[ ECC H MATCH o_data />
INFO:
INFO: Finished chat completion streaming request ddbb05d553b8434d80c53784b1d80ffe
INFO: Generation options: {'request_id': 'ddbb05d553b8434d80c53784b1d80ffe',
'max_tokens': 8192, 'min_tokens': 0, 'stream': True, 'token_repetition_penalty': 1.0,
'token_repetition_range': -1, 'token_repetition_decay': 0, 'token_frequency_penalty': 0.0,
'token_presence_penalty': 0.0, 'temperature': 1.2, 'smoothing_factor': 0.0, 'min_temp': 1.0,
'max_temp': 1.0, 'temp_exponent': 1.0, 'top_k': 0, 'top_p': 1.0, 'top_a': 0.0, 'min_p': 0.0,
'tfs': 1.0, 'typical': 1.0, 'skew': 0.0, 'temperature_last': False, 'mirostat': False,
'mirostat_tau': 1.5, 'mirostat_eta': 0.3, 'mirostat_mu': None, 'token_bias': None,
'cfg_scale': None, 'post_sampling_hooks': [], 'dry_allowed_length': 2, 'dry_base': 1.75,
'dry_multiplier': 0.0, 'dry_sequence_breakers': None, 'dry_range': 0, 'dry_max_ngram': 20,
'ngram_trie': None, 'ngram_index': 0, 'ngram_history': deque([]), 'token_healing': False,
'auto_scale_penalty_range': False, 'generate_window': 4096, 'bos_token_id': 128000,
'eos_token_id': [128039], 'add_bos_token': True, 'ban_eos_token': False,
'skip_special_tokens': True, 'speculative_ngram': False, 'logprobs': 0, 'stop_conditions':
[128039], 'banned_tokens': [], 'allowed_tokens': [], 'banned_strings': [], 'logit_bias':
None, 'filters': []}
INFO: Metrics (ID: ddbb05d553b8434d80c53784b1d80ffe): 767 tokens generated in 45.27
seconds (Queue: 0.0 s, Process: 162 cached tokens and 206 new tokens at 285.84 T/s, Generate:
17.22 T/s, Context: 368 tokens)
I have enabled uvloop, realtime, fasttensors for information.
Model is hash verified from LoneStriker/Hermes-3-Llama-3.1-70B-4.65bpw-h6-exl2
Edit: I run 32768 context at Q6
Thanks!
There still doesn't seem to be any truncation sampling, which means there's a non-zero probability of sampling some quantization noise from the tail end of the distribution and getting literally any output. Try setting a top-P of 0.8 or something. Enabling temperature_last can also help if you want to sample at higher temperatures.
There still doesn't seem to be any truncation sampling, which means there's a non-zero probability of sampling some quantization noise from the tail end of the distribution and getting literally any output. Try setting a top-P of 0.8 or something. Enabling temperature_last can also help if you want to sample at higher temperatures.
Thanks, i'll try it out. I'm not an expert by any mean but the lack of truncation sampling isn't supposed to mess up a few tokens, and not the entire generation?
It's autoregressive. Models can usually correct some errors in the input (or the previous output that becomes the new input) but this becomes harder (less likely) at higher temperatures. And it's an unpredictable process at any rate since a poor choice of token still becomes part of the context and has to be understood one way or another.
That said, cache quantization is questionable, and it's worth trying to disable it or trying Q8 instead, running the model with a smaller cache just to see if the behavior changes. All forms of quantization are a hack at the end of the day, and never guaranteed to work equally well for any finetuned or merged or abliterated variant of any model. Especially without any guardrails (truncation).
It could be worth noting that the generation_config.json for Llama3.1-70B suggests a temperature of 0.6 with top-P of 0.9.
Thank you for these valuable informations
For now, I am running Mistral Large 2 at ~14.7 tok/s, 3.0bpw h6, 13.5k Q4 context, chunk size 256, 48gb vram. Haven't tried tensor parallelism yet. Generations stays coherent with correct sampling. I'm really happy about it, thanks again.
I have a last question
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:06:00.0 Off | N/A |
| 36% 36C P8 22W / 250W | 23758MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 Off | 00000000:07:00.0 On | N/A |
| 30% 42C P8 29W / 250W | 23890MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 52391 C python3 23748MiB |
| 1 N/A N/A 52391 C python3 23880MiB |
+-----------------------------------------------------------------------------------------+
Adding +256 tokens in cache makes me go OOM. nvidia-smi is showing me I have 1504 MiB free, shouldn't it be enough for +256 tokens? Or maybe I am being too naive to believe smi? I am already using PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True and autosplit_reserve: [0,0] (headless pc). Using cuda malloc backend does not help. And even if this is a splitting problem, there is 600MiB available on the fullest card. Do you have an idea why?
Inference requires some temporary VRAM allocations here and there that won't necessarily show up in nvidia-smi since they're too short-lived.
Also having 600 MB free on a device doesn't mean you necessarily have enough contiguous memory available for the cache tensors to grow by 256 tokens worth of keys/values, since VRAM does get fragmented.
If you want a little more space you can try reducing the chunk size, or try tweaking a manual split. TP mode can also sometimes use a little less VRAM.
94.77 seconds (Queue: 0.0 s, Process: 0 cached tokens and 2026 new tokens at 78.32 T/s, Generate: 14.54 T/s, Context: 2026 tokens)
76.88 seconds (Queue: 0.0 s, Process: 0 cached tokens and 2026 new tokens at 366.63 T/s, Generate: 13.97 T/s, Context: 2026 tokens)
I got some interesting results with/without TP mode, which you might want to see. The heat gain is massive (82 → 94° on VRAM, fan 30-36% → ~80%) for +0.5 t/s, and the prompt ingestion is 4.6x slower. I am not complaining since I can just not use this mode, but here are the numbers, if that can help. Each card is limited to 250W and 1500MHz (because they are 5mm apart 💀), PCIE is 3.0 (B450 moment - I don't know how much this has of an impact).
As for the VRAM usage and fragmentation, I got the first card to use 23.971/24 GB (nvtop) in manual split, but got OOM on the second card. That will need more tweaking, but i'm going in the right direction, ig. Thanks.
Edit: I'm a bit retarded, I just realized the manual split is for the weights without context related memory. Fixing it rn, I am getting higher contexts than autosplit. Great!
Edit 2: With manual split and cuda malloc backend, I managed to get 19712 Q4 context working!
"usage": {
"prompt_tokens": 19711,
"completion_tokens": 1,
"total_tokens": 19712
}
In htop, I have: 23.950Gi/24.000Gi and 23.983Gi/24.000Gi, and 19968 tokens goes OOM
OS
Arch Linux
GPU Library
CUDA 12.x, tried 12.4, 12.5, 12.6, driver nvidia-beta-dkms 560.35.03-1
Python version
3.10 3.11 3.12
Describe the bug
Most of the generations I do from tabbyAPI starts well and break after a few tokens. Simply put, it spits gibberish.
Reproduction steps
config.yml:
Expected behavior
Generate coherent text
Logs
Additional context
Build: RTX 3090 and a Ryzen 5 5950x on Arch Linux Model: https://huggingface.co/turboderp/gemma-2-27b-it-exl2/tree/5.0bpw (hash checked) (the bug happens with every model I try anyway) I encounter this bug since I started using tabby, but this time, it's basically 100% of the time. I tried Python 3.10, 3.11, 3.12 and compiling from source, changing cuda versions, without any benefits. This happens in version 0.1.9 but also all the previous I tried, I simply did not have time to investigate the issue.
Acknowledgements