tloen / alpaca-lora

Instruct-tune LLaMA on consumer hardware
Apache License 2.0
18.55k stars 2.21k forks source link

sharing Alpaca-LoRA chatbot (Gradio app) #87

Open deep-diver opened 1 year ago

deep-diver commented 1 year ago

I hope you guys find it useful. Here is the repo : https://github.com/deep-diver/Alpaca-LoRA-Serve

and I am currently running 30B version on my VM. Try it while it is live. Share results if they are interesting :) : https://notebooksf.jarvislabs.ai/43j3x9FSS8Tg0sqvMlDgKPo9vsoSTTKRsX4RIdC3tNd6qeQ6ktlA0tyWRAR3fe_l/

DanielWe2 commented 1 year ago

Thanks. I have played around to test its German capabilities. It's better than the 7b version that is great. Not sure if that is caused by the bigger dataset with that Llama 30b was grained with compared to the two smaller versions or if the bigger model is the reason.

In my testing of multiple messages it only made one grammar error. Not perfect but it way better than 7b which didn't sound like a true native speaker of German.

Is that an 8 bit or 4bit version of the model?

And how does your code compare to the "text generation web ui" repo?

espeon commented 1 year ago

This is really cool! Messed around with it earlier and was just wondering - what gpu(s) is this running on?

deep-diver commented 1 year ago

8bit one running on A6000

deep-diver commented 1 year ago

Not sure about web ui project. Just I have built this for real serving need. Batch request processing and some handy buttons like summarize and continue to overcome low resource GPU problem.

DanielWe2 commented 1 year ago

Ok, is it possible that there is a commit missing in the app. Py? Can not find the place where it adds the summarize and continue buttons.

deep-diver commented 1 year ago

Dang I missed to upload it. Will share when I commit

BEpresent commented 1 year ago

Thanks this is really nice. I noticed it tends to ignore the context. Is there a recommended format to create context ?

deep-diver commented 1 year ago

@BEpresent that is something I don't know of. we need some experimentations. @DanielWe2 the code has been updated

makovez commented 1 year ago

Hi, really interesting @deep-diver . I noticed he remembers old Q/A? Did you use the same dataset of alpaca to finetune or a another with also followup Q/A

deep-diver commented 1 year ago

@makovez the same dataset (actually the one cleaned-up in this repository).

makovez commented 1 year ago

Im trying the collab version but I'm not able to make it remember old discussions. What did you change to make it remember old discussions ? https://colab.research.google.com/drive/1eWAmesrW99p7e1nah5bipn0zikMb8XYC

deep-diver commented 1 year ago

@makovez

you can actually use the notebook in my repository. here is the link. just remember to change the checkpoint right

makovez commented 1 year ago

I just looked at your repo https://github.com/deep-diver/Alpaca-LoRA-Serve. I noticed that you use: "context aware by keeping chatting history with the following string.."

Do you think this can be improved by using a new dataset with also follow up Q/A?

Because what if the user change context, if you always blindly input the old user inputs, those might be out of context, as the user might have changed context.

deep-diver commented 1 year ago

@makovez This is just a playground. In order to reflect more reality, it should be changed a lot.

My basic idea is to let users to choose when to summarize or not. Since this is running on a single GPU + Int8, the inference speed is not so good. Hence, it wouldn't be possible to let people just wait for a very long time until all the tokens are generated. Also, it gets even worse when there are lots of traffic.

For this reason, I wanted to build a bit better UI for people to interact with. I set timeout=30, so the reponse might not be completed, but I wanted to give some handy option like continue and summarize to go on.


About the Q/A dataset, I don't know honestly. I guess RLHF would be very helpful.

deep-diver commented 1 year ago

there should be lots of tweaks on pre/post processings. For instance, when the response is written in Markdown format, it wouldn't be rendered properly. It should be converted in HTML format. Also, the HTML converted text should be re-converted back to the original text since lots of HTML tags wouldn't be helpful for LLaMA to understand the context.

makovez commented 1 year ago

But do you think this is what chatgpt does? Hard code previous Q/A in new input?

Because otherwise how can it remember context, for each different chat of user?

deep-diver commented 1 year ago

Yeap. at some point ChatGPT should be doing this since there is limit max input size. it depends how accurately remembers the details of the previous conversations. ( I mean ChatGPT should summarize things)

deep-diver commented 1 year ago

it does not remember things (like it does not hold the context in memory). Instead, it consumes bunch of texts including many texts to know the context. It simply forward passes the given data, and attentions figures out which tokens to look up given long long text. no?

DanielWe2 commented 1 year ago

But do you think this is what chatgpt does? Hard code previous Q/A in new input?

Because otherwise how can it remember context, for each different chat of user?

I don't think ChatGPT does remember previous sessions at all. (at least the last time I used it). Bing doesn't.

If you want some kind of persitence memory you would most likely connect it to some kind of database.

LangChains does provide infrastrucute for that. For different kind of memorys for ChatBots. I think it would be good idea to base ChatBot on top of that, it already solved a lot of thoses problems and also has tools to add differernt serivce (online search or database search for example) to a chat bot.

Take a look at https://langchain.readthedocs.io/en/latest/modules/memory.html and the sub chapters on the left for different type of local or long term memory.

makovez commented 1 year ago

it does not remember things (like it does not hold the context in memory). Instead, it consumes bunch of texts including many texts to know the context. It simply forward passes the given data, and attentions figures out which tokens to look up given long long text. no?

Probably yes.

deep-diver commented 1 year ago

so similar approach to what LangChain does because it is very straight forward, and the behaviour of the model doesn't change. Buffer - I pass N previous conversations directly. Summary - I summarize and pass it to the model as context.

I just didn't try enough to find out the best prompts yet, so I just had not enough time to combine the two.

DanielWe2 commented 1 year ago

so similar approach to what LangChain does because it is very straight forward

LangChain can also store it in the database (as summery or as knowledge graph) that is way more complex and I would like to play arround it.

I think it would be worth to add a few hundred or thousend train examples in the structure needed for LangChain integreation like tools usage and memory.

deep-diver commented 1 year ago

Sounds great. I think it is pretty much doable. Just need to figure out how to make this happen fast enough. As you see, it takes about 20-30 seconds to generate a handful of text.

If I input many many texts into it, it gets even slower. So, I guess I need to find a way to speed up before applying memory structure.

makovez commented 1 year ago

Sounds great. I think it is pretty much doable. Just need to figure out how to make this happen fast enough. As you see, it takes about 20-30 seconds to generate a handful of text.

If I input many many texts into it, it gets even slower. So, I guess I need to find a way to speed up before applying memory structure.

How do you plan to speed it up?

deep-diver commented 1 year ago

Need to explore every possible combinations of different hyperparameters + float16 instead of 8int

DanielWe2 commented 1 year ago

For small inputs the 4bit GPTQ version is faster. But for bigger contexts there is some kind of quadratic algorithm in the stack that needs to be optimized...

BEpresent commented 1 year ago

@BEpresent that is something I don't know of. we need some experimentations. @DanielWe2 the code has been updated

I get something like this, when adding a context. I'm not sure if I made any obvious mistake?

image
makovez commented 1 year ago

@BEpresent that is something I don't know of. we need some experimentations. @DanielWe2 the code has been updated

I get something like this, when adding a context. I'm not sure if I made any obvious mistake?

image

Actually context doesn't seem to work. Screenshot_2023-03-20-21-42-57-94_40deb401b9ffe8e1df2f1cc5ba480b12.jpg

But doing this works.

Ps. Not sure where that John Smith comes from but ok🤣

BEpresent commented 1 year ago

Right, it seems to ignore the initial context, but seems to pick it up from the chat if mentioned

deep-diver commented 1 year ago

Thanks for the feedback! Need to check how to craft better prompt template

deep-diver commented 1 year ago

Updated! Check out the new issue