oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
38.46k stars 5.09k forks source link

Expose endpoint/extension hook for updating the chat in the webui #661

Closed tensiondriven closed 7 months ago

tensiondriven commented 1 year ago

Description

Right now, the internal extension api supports intercepting a message going from the user to the chat engine, and from the chat engine going to the user. This means we can augment what the end users sees, and augment what the chat receives based on user input.

The API example appears to spin up a new / independent server, and allows sending input to the chat engine.

What I'm looking for is to be able to send a chat, independent of user interaction, to the chat engine, such that the normal gradio browser UI would update with the message, and the response.

This could also be supported by a command-line script or option that would update the prompt in a given session. Then, one could hook into the existing extension hook to recieve messages, and use a command-line script to send them. The challenge seems to be how to specify the session to use, since gradio sessions don't appear to be shared.

I tried monkey patching something into server.py in the loop at the end of the file, but it was pretty gross and didn't work. The problem was that it wasn't clear to me what options need to be passed to the function that the Gradio "generate" button invokes - but python is not my first language.)

Would it be possible to expose this?

Additional Context

There are three extensions that I'm interested in writing:

Bot: *texts you* foobar

These two ideas require being able to initiate a message from the user on their behalf. This might be tricky since the session doesn't appear to be shared between clients, that is, if i open up text-generation-webui on my phone and on my laptop, the chat histories will be different. So it might require exposing a session id, or something?

@oobabooga (and other contributors), if this isn't something that seems feasible in the short term, could you give me an some pointers as to how you might implement it, so I could give it a shot?

bmoconno commented 1 year ago

These two ideas require being able to initiate a message from the user on their behalf. This might be tricky since the session doesn't appear to be shared between clients, that is, if i open up text-generation-webui on my phone and on my laptop, the chat histories will be different. So it might require exposing a session id, or something?

I'm not sure that's necessarily the case, I just added a little code to the generate_reply method in text_generation (this is the method called via the textgen api) to write the incoming prompt/question like this:

    print('###### LAST PROMPT')
    print(shared.processing_message)
    shared.processing_message = question
    print('###### END LAST PROMPT')

I used the existing shared.processing_message field since it isn't used in the default view.

First I input: The following is a list of three shapes:

On the server I saw it output the following:

###### LAST PROMPT
*Is typing...*
###### END LAST PROMPT

Which makes sense since that was the original value of the shared.processing_message field, then I sent the following via the api: This shape has three sides:

Which resulted in:

###### LAST PROMPT
The following is a list of three shapes:
###### END LAST PROMPT

So I think that there is only one instance of the memory for the server, no matter how many people are connected. I suspect that if you look at the persistent log while you have your chat with the same character on multiple devices, you'll see that the log is being updated correctly with both clients, but the UI is only updating for the one that sent the last message.

I looked around and there doesn't seem to be a way to force a part of the UI to update via a function, best I could find is a run_forver parameter but I'm not sure if that ever got actually added to Gradio and I'm on lunch break so no time to dig into it.. but I love the idea of letting the chat character initiate conversation based on like a timer or something.

brandonj60 commented 1 year ago

I looked around and there doesn't seem to be a way to force a part of the UI to update via a function, best I could find is a run_forver parameter but I'm not sure if that ever got actually added to Gradio and I'm on lunch break so no time to dig into it.. but I love the idea of letting the chat character initiate conversation based on like a timer or something.

The closest I could get to this was a button that would refresh the chat HTML every X seconds on a timer, however it's super clunky, and you need to actually click it first to kickoff the refreshes. Am still looking for a better solution. refreshHtml.click(eval('chat.generate_chat_output'), [shared.gradio['name1'], shared.gradio['name2']], shared.gradio['display'], show_progress=shared.args.no_stream, queue=True, every=5)

tensiondriven commented 1 year ago

This is great, thank you both

force a part of the UI to update via a function,

What's puzzling to me about this is that the whole thing is driven by websockets, which suggest there's a way to push state to the browser, at least that's how I'd think of it. When I click Generate and send a message, the response streams to the server, which is consistent with my mental model.

When I looked at the Gradio UI setup in server.py, it looks like it works by creating a UI element and giving it a function to call when the button is pressed. If that's true, then we should be able to do the same thing programatically, unless Gradio is storing some kind of state that puts it into a "waitng for update" condition, or (just thinking through this), the UI element depends on the return value of that function to update the app. I guess the latter would be consistent with my mental model.

So I think that there is only one instance of the memory for the server, no matter how many people are connected.

Man, if this is true, it's kind of a big issue - I read a little bit about Gradio state this morning, looking at the docs again, it looks like Gradio does have a provision for both "Global State" and "Shared State". It might just be that we have a bug where the global state and shared state aren't isolated properly, which is understandable on a project that is growing fast with several contributors. That also indicates (to me, at least) that we should be able to fix this, or at least find an injection point somehow. I don't think we are really stuck with using Javascript to poll, at least not funadmentally

I suspect that if you look at the persistent log while you have your chat with the same character on multiple devices, you'll see that the log is being updated correctly with both clients,

That's a great idea, and consistent with what the logs directory looks like. Each request to textgen would have all of its own state; parameters, temperature, etc (which is great), but then at some point it's all going through the same code path, and the logs reflect that.

Coming from elixir/erlang land, I imagine messages passing between processes, but if the python architecture is synchronous, then that wouldn't necessarily mean that there's a session token or process id or anything that we can key off of.

We also can't be the first people to run into this - I'm sure others have had this issue with Gradio before.

If anyone looks into it deeper, notice that there's now a "progress bar" at the top of the chat window (i think this was added in a recent commit) - this type of UI element would be driven by events from the server, so maybe there's something to learn by inspecting that code. (Another enhancement that I'm fantasizing about is catching the CUDA out-of-memory error and displaying a message to the user, or displaying current context length and free memory in the UI somewhere, which would leverage similar functionality)

API

Last Q; When you say you hit the api, is that the API that is loaded up in extensions/api/script.py, or a different API? Can you post an example curl or something, if it's handy?

bmoconno commented 1 year ago

When I looked at the Gradio UI setup in server.py, it looks like it works by creating a UI element and giving it a function to call when the button is pressed.

Yeah, this is an example of that:

gen_events.append(shared.gradio['Generate'].click(generate_reply, shared.input_params, output_params, show_progress=shared.args.no_stream, api_name='textgen'))

That's where the click event handler is added to the Generate button for the default view, the param at the end api_name='textgen' basically makes an api endpoint to allow you to call that function via the API, so the function being called by the UI and the API is exactly the same. The show_progress=... is what tells the UI if it should show the loading/progress bar when the function is called. In this case, it bases that on the --no-stream setting, so if that's enabled it'll show the loading bar, otherwise it'll just stream in the text as it's generated.

Man, if this is true, it's kind of a big issue

Yeah, maybe if you're planning on sharing a running model with a lot of people, but this implementation I think is mainly set up for personal use/testing... at least for now.

We also can't be the first people to run into this - I'm sure others have had this issue with Gradio before.

I think the primary issue here is that Gradio is designed specifically for Big Data researchers and stuff, not really for people who want to make things with it. I think it's really good for what it is, but I suspect Gradio will continue to be a roadblock which might force people who are trying to make more interactive type stuff to ditch the easy Gradio UI and just keep the python backend.

Last Q; When you say you hit the api, is that the API that is loaded up in extensions/api/script.py, or a different API? Can you post an example curl or something, if it's handy?

For testing with the API I just used to api-example.py file, I modified it to use the port and IP for my machine running the server (for 0.0.0.0 you'll need to use localhost)... but looking at the file it seems like you could just POST the JSON yourself to the http://{server}:7860/run/textgen endpoint (this is the Generate button mentioned above).

tigerbears commented 1 year ago

@tensiondriven @bmoconno Good thread; I've been looking into this a little too. (Dusty objc background w/minimal Python here.) A while ago I managed to hack some ugly state restoration in when refreshing your browser window, which confirmed the single-user design of the current code; I could update conversation state from two separate computers by refreshing to pick up the other's changes. I'd post my changes, but they got a little crufty and are definitely not efficient.

And yeah, Gradio seems like it might be good at what it's trying to do, but it's not trying to be a robust product toolkit. Agreed it's a good way to start but will expose limits quickly.

I sorta want to help banging on moving some of this data into lightweight objects that could be dumped into sessions and/or facilitate chat API. I suspect from my aforementioned hackery that some of that will have to happen before a chat API can work well. Are there any other thoughts, plans, designs or work on that front? Don't want to step on any toes or plans!

tensiondriven commented 1 year ago

Gradio is so widely used, and the session management documented so well, that I have to think what we need is well-supported; it probably just hasn't been a design concern until now. Soup from a stone, as the story goes.

My current thinking is that some things would need to change for this to work, but they wouldn't been too disruptive. We would need some sort of session token, which could identify each session. If we had that, we could list sessions, and give users the choice of joining and existing, or creating new ones. As a simple "hack" (not fond of that word in this context, but it works) when appended as a query parameter, a session value could allow someone to move their session. The log files would need session info prepended, or maybe there's SQLite in our future (not in love with this, but once configuration and state management gets hairy enough, something like that will be useful)

Sorry if I'm rambling - I'm excited about this!

If you did want to post your code, or even the relevant parts with some comments, that could be super useful, assuming Gradio hasn't changed too much internally since then.

Otherwise, I'll probably get together with GPT4 later and see if we can make sense of it. This might be trivial, for the right mind.

bmoconno commented 1 year ago

...you need to actually click it first to kickoff the refreshes. Am still looking for a better solution.

@brandonj60 I found a slightly better solution, but it's still janky while text is being generated, just gets rid of the need to press a button:

To test it out you can add the following code to the server.py file after shared.gradio['display'] is set around line 296:

# This wrapper is necessary to make the "every" attribute work
def refresh_html_chat():
    return generate_chat_html(shared.history['visible'], shared.settings['name1'], shared.settings['name2'], shared.character)

# This causes the refresh_html_chat function to run ever 2 seconds.
shared.gradio['refresh_chat'] = shared.gradio['interface'].load(refresh_html_chat, [], shared.gradio['display'], every=2)

I've been trying to figure out a way to cancel the event while text is being generated, using something like:

shared.gradio['display'].change(None, None, None, cancels=shared.gradio['refresh_chat'])

Which seems to work, but I can't figure out a way to re-add it because even if I add something to shared.py to track if text is currently being generated, I can't re-add a new .load event because no matter what I try it's not int the with gr.Blocks context anymore.

But I guess this is a step closer to what you were hoping for?

bmoconno commented 1 year ago

Small update, I added a variable to shared.py called is_generating_text, it defaults to false, and I set it true at the start of the start of the generate_reply method in text_generation.py then set it to false in the finally section of that same method.

I then updated the above code to replace the load line with:

shared.gradio['refresh_chat'] = shared.gradio['interface'].load(refresh_html_chat, [], None if shared.is_generating_text else shared.gradio['display'], every=2)

This is a lot less jarring, but still not perfect. There's still a ton of issues that will probably be solved by adding session/state stuff. I'm probably busy the rest of today, but I'll try to look into that tomorrow if I can find some free time.

tensiondriven commented 1 year ago

@tigerbears @bmoconno @brandonj60 Thanks for all the legwork and posting your progress.

I just spent a decent amount of time researching this and I've come to similar conclusions and strategies as everyone else. Here's a summary of my observations, and then some thoughts about my personal needs/interest:

My personal reflection:

I was really hoping to contribute to text-generation-webui for the work I'm interested in doing personally, in hopes that others could build on it and I could benefit from it, but based on these constraints, I'm not sure it makes sense. Still totally open to that possibility, though. For what I'm interested in, it may make sense to apply a similar hack as above so I can use TGWUI. The problem with session bleed worries me though, as I'd like to be able to show my work to others, but I don't want my sessions bleeding into theirs. (I realize this is a little vague as I haven't described all of my requirements yet, but thought it would be useful to share, or possibly useful anyway.)

Another observation, as a python outsider, is that python is great for prototyping because of how lenient it is with scope - I can define variables in a function and they're available elsewhere, which is convenient, but it suffers all the problems of other similar languages. I noticed as a long term professional web developer that has to work on a lot of projects that there doesn't seem to be as much attention put on readability, modularity, etc as I'd expect, but this may be a symptom of how fast the project is moving, or of the general disposition of data scientists vs web developers. The very cool thing is that I was able to go from having not touched python in 10 years to finding issues in about 1.5 days! Python is really forgiving in the early stages of development, but that forgiveness is rescinded as project size grows if application design isn't constrained.

I will likely use TGWUI to stand up a model and do inference via its api. The good news is that since the API is stateless, each request will be isolated. I'm also curious to find out of Gradio/TGWUI's http/rest API can run multiple concurrent requests. The bad news is that I'll have to do all of my own session/state management. Given my personal preference for Elixir, whatever I build will probably be done in Elixir. It has a wonderful concurrency model and with recent additions like Livebook, Axon, and the venerable Phoenix and Liveview, I think it will fill in the gaps that are missing. I don't intend to do any actual machine-learning/data science work in Elixir, really just going to use it to glue API's together and get data in/out of TWGUI.

I don't want any of this to sound like I'm shitting on TGWUI. I love how this project is taking on being the boundary between the user the language models. The amount of work and collaboration I've seen getting all the different models, different variations of models, quantization, training, and inference strategies is really inspiring. The diversity of experience in the userbase is awesome, too - It's designed for amateurs to jump in and spin up a model, and, it's incorporating bleeding edge features by the day. I think there's probably a lot to learn from stable-diffusion-webui in terms of how that project has evolved. I would love to listen in on a conversation between @oobabooga and @automatic1111 discussing the challenges of managing such fast-growing, cutting edge projects! text-generation-webui is far and away the most exciting open source project I've ever been involved with.

Regarding this specific ticket, it looks like it can be solved pretty easily with the strategy @bmoconno is proposing, and without negative impact on the rest of the app.

bmoconno commented 1 year ago

@tensiondriven

text-generation-webui is currently not designed for multiple concurrent clients (multiple independent sessions). This needs to be fixed before anything like what we're talking about will be useful, i think. As far as I know, there is no "client list" state tracking, which seems a natural next step for improvement.

I wasn't able to find any time today to test making the server.py support multiple users without having them share the same context, but I was thinking it might be as simple as using the built-in State from Gradio that you mentioned earlier in this thread, and save the current shared global object to it:

user_state = gr.State(shared)

Then just work through the create_interface function and update it to use the new user_state variable, which is easier said then done, since some of the functions called from other files also use the global shared variable, so might need to start passing data between functions instead of just using the global variable.

Once this change is made though, I'm not entirely sure how we would then allow API calls to specific users, might need to also use the built-in Request object with the --gradio-auth-path flag to require usernames/passwords. Then we could expand on your html_needs_update idea and, in theory, only update the UI for specific users who have received an update via API

I will likely use TGWUI to stand up a model and do inference via its api.

I believe this is probably the best path forward right now, like you mentioned in another post, it seems like a lot of the development right now is focused on supporting new technologies and models as they crop up, which is definitely important.

In the next day or so, I'll probably look at the State stuff mentioned above or I was also thinking about tackling the request I've seen a few times to make the API work for the chat modes... and maybe also an extension to support multiple characters in chat that would send different context based on which of the characters is speaking.... But as I'm still trying to learn the repo (and python), and I learn best by figuring stuff out trying solve problems, if there's something at the top of your wishlist and you'd like me to help try to figure it out, let me know and I'll try to put something together.

P.S. Sorry if this comment re-opened this issue 😄

tensiondriven commented 1 year ago

I'm in a similar boat and it sounds like we're pretty aligned, maybe we'd do well to collaborate real-time via slack/discord?

bmoconno commented 1 year ago

Sure, I'm mostly on Discord since we use it for work. bmoconno#4583

I just joined the unofficial discord for text-generation-webui from this post, but instantly muted the channel. I'll stick in there for now though, checking in on the dev channels to see what other people are up to.

ye7iaserag commented 1 year ago

Did you guys get anywhere with this? I tried to create an event handler (html.change) to use it with a JS websocket client on the same gradio socket which actually worked but the queue function called on server.py file allows only 1 connection at a time, so I had to scrap the idea.

Something like:

webSocket = new WebSocket("ws://127.0.0.1:8889/queue/join");//window.gradio_config.root

webSocket.onmessage = (event) => {
    let obj = JSON.parse(event.data)
    if (obj.msg === "send_hash") webSocket.send('{"fn_index":60,"session_hash":"iqh9bpzlhon"}')
    console.log(event.data);
};

and have a generator function on gradio's side the yields values.

bmoconno commented 1 year ago

Unfortunately, no. I haven't gone back to looking at this yet. Haven't had a lot of time to work on this stuff and when I do I've been trying to work on the deeper things first since I already had a "working" solution for this with the every argument, just figured I'd go back and refine it to make it better later.

I do like your approach though, would be good if we could make something like that work.

ye7iaserag commented 1 year ago

My only problem with reusing the gradio websocket manually is that finding the function id has to be manual, else it functions in the same way as long polling for http requests, but I think it would not timeout in the case of yielding infinitely plus the need to setting concurrency_count in the queue function > 1

github-actions[bot] commented 7 months ago

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.