Open deep-diver opened 1 year ago
Thanks. I have played around to test its German capabilities. It's better than the 7b version that is great. Not sure if that is caused by the bigger dataset with that Llama 30b was grained with compared to the two smaller versions or if the bigger model is the reason.
In my testing of multiple messages it only made one grammar error. Not perfect but it way better than 7b which didn't sound like a true native speaker of German.
Is that an 8 bit or 4bit version of the model?
And how does your code compare to the "text generation web ui" repo?
This is really cool! Messed around with it earlier and was just wondering - what gpu(s) is this running on?
8bit one running on A6000
Not sure about web ui project. Just I have built this for real serving need. Batch request processing and some handy buttons like summarize and continue to overcome low resource GPU problem.
Ok, is it possible that there is a commit missing in the app. Py? Can not find the place where it adds the summarize and continue buttons.
Dang I missed to upload it. Will share when I commit
Thanks this is really nice. I noticed it tends to ignore the context. Is there a recommended format to create context ?
@BEpresent that is something I don't know of. we need some experimentations. @DanielWe2 the code has been updated
Hi, really interesting @deep-diver . I noticed he remembers old Q/A? Did you use the same dataset of alpaca to finetune or a another with also followup Q/A
@makovez the same dataset (actually the one cleaned-up in this repository).
Im trying the collab version but I'm not able to make it remember old discussions. What did you change to make it remember old discussions ? https://colab.research.google.com/drive/1eWAmesrW99p7e1nah5bipn0zikMb8XYC
@makovez
you can actually use the notebook in my repository. here is the link. just remember to change the checkpoint right
I just looked at your repo https://github.com/deep-diver/Alpaca-LoRA-Serve. I noticed that you use: "context aware by keeping chatting history with the following string.."
Do you think this can be improved by using a new dataset with also follow up Q/A?
Because what if the user change context, if you always blindly input the old user inputs, those might be out of context, as the user might have changed context.
@makovez This is just a playground. In order to reflect more reality, it should be changed a lot.
My basic idea is to let users to choose when to summarize or not. Since this is running on a single GPU + Int8, the inference speed is not so good. Hence, it wouldn't be possible to let people just wait for a very long time until all the tokens are generated. Also, it gets even worse when there are lots of traffic.
For this reason, I wanted to build a bit better UI for people to interact with. I set timeout=30
, so the reponse might not be completed, but I wanted to give some handy option like continue
and summarize
to go on.
About the Q/A dataset, I don't know honestly. I guess RLHF would be very helpful.
there should be lots of tweaks on pre/post processings. For instance, when the response is written in Markdown format, it wouldn't be rendered properly. It should be converted in HTML format. Also, the HTML converted text should be re-converted back to the original text since lots of HTML tags wouldn't be helpful for LLaMA to understand the context.
But do you think this is what chatgpt does? Hard code previous Q/A in new input?
Because otherwise how can it remember context, for each different chat of user?
Yeap. at some point ChatGPT should be doing this since there is limit max input size. it depends how accurately remembers the details of the previous conversations. ( I mean ChatGPT should summarize things)
it does not remember things (like it does not hold the context in memory). Instead, it consumes bunch of texts including many texts to know the context. It simply forward passes the given data, and attentions figures out which tokens to look up given long long text. no?
But do you think this is what chatgpt does? Hard code previous Q/A in new input?
Because otherwise how can it remember context, for each different chat of user?
I don't think ChatGPT does remember previous sessions at all. (at least the last time I used it). Bing doesn't.
If you want some kind of persitence memory you would most likely connect it to some kind of database.
LangChains does provide infrastrucute for that. For different kind of memorys for ChatBots. I think it would be good idea to base ChatBot on top of that, it already solved a lot of thoses problems and also has tools to add differernt serivce (online search or database search for example) to a chat bot.
Take a look at https://langchain.readthedocs.io/en/latest/modules/memory.html and the sub chapters on the left for different type of local or long term memory.
it does not remember things (like it does not hold the context in memory). Instead, it consumes bunch of texts including many texts to know the context. It simply forward passes the given data, and attentions figures out which tokens to look up given long long text. no?
Probably yes.
so similar approach to what LangChain does because it is very straight forward, and the behaviour of the model doesn't change. Buffer
- I pass N previous conversations directly. Summary
- I summarize and pass it to the model as context.
I just didn't try enough to find out the best prompts yet, so I just had not enough time to combine the two.
so similar approach to what LangChain does because it is very straight forward
LangChain can also store it in the database (as summery or as knowledge graph) that is way more complex and I would like to play arround it.
I think it would be worth to add a few hundred or thousend train examples in the structure needed for LangChain integreation like tools usage and memory.
Sounds great. I think it is pretty much doable. Just need to figure out how to make this happen fast enough. As you see, it takes about 20-30 seconds to generate a handful of text.
If I input many many texts into it, it gets even slower. So, I guess I need to find a way to speed up before applying memory structure.
Sounds great. I think it is pretty much doable. Just need to figure out how to make this happen fast enough. As you see, it takes about 20-30 seconds to generate a handful of text.
If I input many many texts into it, it gets even slower. So, I guess I need to find a way to speed up before applying memory structure.
How do you plan to speed it up?
Need to explore every possible combinations of different hyperparameters + float16 instead of 8int
For small inputs the 4bit GPTQ version is faster. But for bigger contexts there is some kind of quadratic algorithm in the stack that needs to be optimized...
@BEpresent that is something I don't know of. we need some experimentations. @DanielWe2 the code has been updated
I get something like this, when adding a context. I'm not sure if I made any obvious mistake?
@BEpresent that is something I don't know of. we need some experimentations. @DanielWe2 the code has been updated
I get something like this, when adding a context. I'm not sure if I made any obvious mistake?
Actually context doesn't seem to work.
But doing this works.
Ps. Not sure where that John Smith comes from but ok🤣
Right, it seems to ignore the initial context, but seems to pick it up from the chat if mentioned
Thanks for the feedback! Need to check how to craft better prompt template
Updated! Check out the new issue
I hope you guys find it useful. Here is the repo : https://github.com/deep-diver/Alpaca-LoRA-Serve
and I am currently running 30B version on my VM. Try it while it is live. Share results if they are interesting :) : https://notebooksf.jarvislabs.ai/43j3x9FSS8Tg0sqvMlDgKPo9vsoSTTKRsX4RIdC3tNd6qeQ6ktlA0tyWRAR3fe_l/