flash attention - Githubissues

ehartford commented 1 year ago

Is it possible to use flash attention, in order to fine-tune with longer conversations instead of just a question-and-answer?

ehartford commented 1 year ago

https://github.com/lm-sys/FastChat/search?q=flash

johnsmith0031 commented 1 year ago

Thanks! I'll try it later

dpyneo commented 1 year ago

@johnsmith0031 Thank you. May I ask if it's possible to train a multi round conversation by directly adjusting to the following data format, “# GPT4All-like Data class TrainGPT4All(ATrainData): ... newline_ tokens = self.tokenizer("\n", return_tensors="pt")["input_ids"][0, 1:]”, Is it possible to fine tune a good conversation with a line break? However, the conversation length is indeed relatively short. I briefly tried truncating it to a maximum of 700, which slows down the training speed. By fine-tuning it on the existing Q&A Lora, a simple two round conversation is still possible. Moreover, the greedy algorithm used for reasoning may not seem as good as Q&A, but it may not be well trained, At that time, I tried to see if I could run through multiple rounds of conversations with data such as' {"topic": "Identify the OD one out. '" "input":' The conversation between human and AI assistant. ' n [| Human |] Identify the OD one out. n Twitter, Instagram, Telegram n [| AI |] Telegram n [| Human |]'}/r/n/r/n, 'I don't know if the flash attention can be longer and reach 2048?

ehartford commented 1 year ago

It's possible but without flash attention it will consume too much VRAM.