Screen Rendering Slows Down Towards the End of Streaming from LLM Server

calycekr commented 3 months ago

While the initial part of streaming from the LLM server is fine, the screen display speed slows down as time progresses, particularly towards the end of the streaming process. However, the LLM server has already finished sending the data, and only the screen display continues to be updated.

https://github.com/huggingface/chat-ui/blob/6de97af071c69aa16e8f893adebb46f86bdeeaff/src/routes/conversation/%5Bid%5D/%2Bserver.ts#L390-L396

I suspect the above code section. Could it be a shortage of buffer?

evalstate commented 3 months ago

This has been troubling me too. My quick investigation is leading me here:

https://github.com/huggingface/chat-ui/blob/d7b02af46d2c3e6f5b954d97ebbefdfbd0b3b612/src/lib/utils/messageUpdates.ts#L227-L248

Looks like the buffer on the browser side fills up, and the delay calculations don't work quite right in that situation. Inference with bigger contexts is very bursty which i think skews it.

nsarrazin commented 3 months ago

I suspect the above code section. Could it be a shortage of buffer? Don't think that's the issue as the smoothing only occurs for tokens iirc, not raw updates so it shouldn't be an issue

I think we need to tweak the smoothing function to be faster the bigger the current buffer is. That way if there's a lot of words waiting to be displayed they should come out faster.

Erquint commented 3 months ago

I implore for an option to never throttle token output to the document. It slows down to way beyond a crawl with longer chats and for the entire message, not just towards the end. Whatever was the aesthetic idea behind stalling output when the server endpoint has already streamed all the tokens in — I cannot possibly get behind it. Most dubiously: if I hit the abort button — the entire buffer is dumped into the document at once. But it invites guesswork of when has the endpoint invisibly finished its streaming into the local buffer — you get cut-off responses if you're too early. Just remove the entire abstraction, please.

The way it gets dumped upon halting should give an idea as to where to look for a possible bug in the code.

P. S. I noticed how some other platforms like for example figgs.ai also do fake aesthetic typing but as soon as the endpoint stream is over — the entire buffer is flushed. That's something that could help.

Erquint commented 2 months ago

I notice a change has been made that alters batching of token output but aggravates JS sleeping cumbersomeness. Could idling be decoupled from global tab sleep..? That'd be a good compromise to start with.

evalstate commented 2 months ago

This commit added a PUBLIC_SMOOTH_UPDATES flag:

https://github.com/huggingface/chat-ui/commit/a59bc5e1436b80d6867ff1b7c76103a2cf78517b

Set it to true to enable the existing behaviour.

Erquint commented 2 months ago

I have to correct myself.
I made a favorable interpretation, assuming, foolishly, that the browser-side sleep you implemented was blocking the main thread and is being used intentionally for that effect. In tested fact, I now realize it's not blocking and the hard hanging of the tabs during the ever increasingly long bursts (literally a dead tab for over a minute at worst) is rather due to horrid performance.
There is no world in which printing a few words should take a minute, even if you flung every possible framework into the project and Svelte-Vite-Buzzworded all of it.

Spent days trying to debug and profile, but it's a fool's errand when you obfuscate everything.
Everything seems to point towards the pendingMessage.DJWnRfGi.js with its 800 kb (!) of minified and obfuscated code, which you'd expect to be compiled from pendingMessage.ts found in this repo — but that's just an empty husk with no meaningful contents or even imports.
So I'm left staring at…

dl=function(L){let le=null,Ee=null;if(za)L="<remove></remove>"+L;else{const Gt=k(L,/^[\r\n\t ]+/);Ee=Gt&&Gt[0]}Fr==="application/xhtml+xml"&&dr===B0&&(L='<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body>'+L+"</body></html>");const vt=Kt?Kt.createHTML(L):L;if(dr===B0)try{le=new Wt().parseFromString(vt,Fr)}catch{}if(!le||!le.documentElement){le=Aa.createDocument(dr,"template",null);try{le.documentElement.innerHTML=Ra?Er:vt}catch{}}const Xt=le.body||le.documentElement;return L&&Ee&&Xt.insertBefore(X.createTextNode(Ee),Xt.childNodes[0]||null),dr===B0?u1.call(le,tr?"html":"body")[0]:tr?le.documentElement:Xt}

👀 Marvellous..!
And even when I try searching for nearby inline string constants that survive obfuscation — they're simply not present in this repo.
So we have: file references are useless, function references are useless and inline constants are useless in cross-referencing the repo.

Was obfuscation really necessary in an open-source project???

Here's a profile screenshot…
profile_screenshot
Ran this profile less than it really took a message to fake-stream, to conserve the buffer length.
Digging into the call trees is useless due to obfuscation.
call_tree_screenshot This is the modern web development — extremely hostile to contributors by design.

evalstate commented 2 months ago

I've spent a bit of time instrumenting this; if you npm run dev you should be all set?

bradnow commented 1 month ago

Regarding poor UI performance when messages are very large, we found that the bulk of time was being spent in ChatMessage for this line: (thanks to @Erquint for posting his profiling images)

$: tokens = marked.lexer(sanitizeMd(message.content ?? ""));

During response streaming the entire conversation is re-rendered and re-parsed for each update. We tried caching the tokens between renders and this seemed to fix the issue for conversations up to 40k or so.

nsarrazin commented 1 month ago

Would happily accept a PR to improve this! @bradnow

huggingface / chat-ui

Screen Rendering Slows Down Towards the End of Streaming from LLM Server #1388