oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.67k stars 5.31k forks source link

train lora #1780

Closed one-pip closed 1 year ago

one-pip commented 1 year ago

any instruciton for training lora? youtube, thanks

ClayShoaf commented 1 year ago

https://github.com/oobabooga/text-generation-webui/blob/main/docs/Training-LoRAs.md

ClayShoaf commented 1 year ago

Did that answer your question?

GamingDaveUk commented 1 year ago

https://github.com/oobabooga/text-generation-webui/blob/main/docs/Training-LoRAs.md

For clarity, as I am currently looking on how to train. I take it from the link I can not just give it the class files from https://github.com/Suprcode/mir3-zircon but would need to make a json file with a description of everything a developer may be likely to ask?

{
        "instruction": "Where do i make changes to the player during the game",
        "input": "",
        "output": "all code related to a player object is located in PlayerObject.cs in ServerLibrary/Models/ you can reference the object via the SEnvir players list which contains a list of players currently logged into the server"
    },
    {
        "instruction": "How do i know what pets a player has",
        "input": "",
        "output": "in the PlayerObject.cs is a pets list, you can get the pets.count if you have the PlayerObject referenced, you can also iterate through the pets list which is a list of MonsterObjects usually (but not always) with the player objedct referenced as PetOwner"
    },

and so on? So these large language models are not really AI but essentially just really large FAQs with so much information that it gives the illusion that they are generating a response? I have to say I feel a bit sad at that realisation, though had uptill now been completly fooled.

ClayShoaf commented 1 year ago

It depends on the dataset you are using. If you were to feed it in the raw text from the docs and train on that, it should be better at answering questions about the game generally. Building effective datasets is the most time consuming part of training.

If you're thinking that it's just a database that looks up replies trained into it, that's not what is happening with transformers models.

GamingDaveUk commented 1 year ago

It depends on the dataset you are using. If you were to feed it in the raw text from the docs and train on that, it should be better at answering questions about the game generally. Building effective datasets is the most time consuming part of training.

If you're thinking that it's just a database that looks up replies trained into it, that's not what is happening with transformers models.

Well I had planed to train one on the class files, so people can use it to answer questions on how the code works, what filunctiins do what, how to add x y or z.

Looking at the link you provided that's not possible, I would need to think of all the questions people may ask and answer them...just effectively making a large faq. Unless I misread the link? If I understand it correctly then I may as well just continue to answer questions myself lol

Really thought the ai could read class files, understand how they are interacting and then answer questions on it lol. I had no idea its just one big search engine

Edit:missed your second paragraph....OK now I am more confused

ClayShoaf commented 1 year ago

I'm on a phone right now, so I can't give a super detailed explanation, but the way it works is the model tries to figure out what the most likely next "token" (usually characters of text) is. The emergent property of seeming to have logic or understanding is a byproduct of the massive amount of data that the models are trained on and the massive amount of parameters (or "weights") that are being processed.

When you train on instruction formats, the model can "pull data" (not really what's happening, but for this purpose, you might as well think of it that way) from the massive datasets that it has already been trained on. So when you see training sets that have alpaca formatting, for example, you are more so training it to respond correctly to that particular formatting than you are training on the information that is in the Q&A.

Don't get me wrong, the Q&A data is also being trained in, but the model is not relegated to only answering those exact questions.

What you would probably want to do is train it on your doc files, so that it has that data trained in. If you train it on a model that has already been trained on instruct formatting, it should be able to answer questions about your docs better.

There is a LoRA on huggingface that is trained on unreal engine docs that you can play around with to see what I mean. It is not trained on Q&A, it is trained on the raw doc files.

GamingDaveUk commented 1 year ago

I'm on a phone right now, so I can't give a super detailed explanation, but the way it works is the model tries to figure out what the most likely next "token" (usually characters of text) is. The emergent property of seeming to have logic or understanding is a byproduct of the massive amount of data that the models are trained on and the massive amount of parameters (or "weights") that are being processed.

When you train on instruction formats, the model can "pull data" (not really what's happening, but for this purpose, you might as well think of it that way) from the massive datasets that it has already been trained on. So when you see training sets that have alpaca formatting, for example, you are more so training it to respond correctly to that particular formatting than you are training on the information that is in the Q&A.

Don't get me wrong, the Q&A data is also being trained in, but the model is not relegated to only answering those exact questions.

What you would probably want to do is train it on your doc files, so that it has that data trained in. If you train it on a model that has already been trained on instruct formatting, it should be able to answer questions about your docs better.

There is a LoRA on huggingface that is trained on unreal engine docs that you can play around with to see what I mean. It is not trained on Q&A, it is trained on the raw doc files.

When you say doc files, there is very little beyond the occasional comment in the code and the guides that some of us have written, or do you mean the cs files themselves?

This is where I am confused, say i took a model designed for q and a, used it as the base and put all the class files (in thier folder structure) into a training folder, what would i need to do to reference the data for the lora to work?

I have created Lora's for image generation and every image has needed a text file with a description of the image... I was somewhat assuming this would be the same... so the PlayerObject.cs would need a PlayerObject.txt with a detailed description of what was in the file... but the link you posted seemed to suggest its needed all in one json file.

I know its all new tech, I know text isnt as documented as images, but I really wish there were some step by step guide videos on this sort of thing lol

My end goal is to create a lora that people can use to help them code new features into the existing code base, allow them to ask questions on how the good base works, to basically be a better version of me (one that understands the code way more than I personally can) that can answer any questions they have as they experiment with the code.

ClayShoaf commented 1 year ago

Unfortunately, the current models may not be up to par for exactly what you're looking for. Consider us in the SD 1.4 territory, to put it in terms that you're familiar with.

You are correct that training LoRAs does not work the same way with LLMs as they do with something like SD. LLMs are literally just trying to do text completion. The whole "chat" feature is a result of using input text that looks like:

The following is a chat log between User and Bot. User: Hello, Bot Bot: Hello, User! User: What is the capital city on Mars? Bot:

And then the LLM starts filling in from there. The only reason it stops is because of programming magic like this repo that recognizes outputs and cuts off the generation. If you were to let it keep generating, it would generate more text for User and Bot (trying to complete the "chat log") and keep going and probably spin off into random territory.

It is not generating images from text tokens, it is generating the next text token, given the previous tokens (and using its own outputs as part of the input for the next token generation). It works fundamentally differently than image generation models.

If you internalize this concept, you can start to think about how the training works. There is no "classification", it is all just tokens and what it thinks the most likely next token will be. This is how ChafGPT works and it's how local models work.

The training is simpler than you are thinking, but to get what you want, you will have to consider how it works. Again, I encourage you to look at the UE 5 LoRA: https://github.com/bublint/ue5-llama-lora

github-actions[bot] commented 1 year ago

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.