tloen / alpaca-lora

Instruct-tune LLaMA on consumer hardware
Apache License 2.0
18.52k stars 2.21k forks source link

About trainning Alpaca on a specific-domain dataset #527

Open ddzipp opened 1 year ago

ddzipp commented 1 year ago

I'm a junior student in software engineering, and first of all, I appreciate your contribution to LLM! I would like to ask some questions about training and dataset organization. I would like to train an LLM that specializes in answering cyber security questions, such as inputting SQL injection statements, and the model can output security analysis in a certain format. To achieve this, what should be the size of the dataset and do I need to use self- instruct to generate some corresponding QA? Should my self-organized dataset be merged with the original training dataset or should I train it in two sessions? Thanks for your help!

yimingz1218 commented 1 year ago

Hello, I am also interested in this question. Have you figured out anything about this? I am mainly interested in fine-tuning a model that can write code to generate tables and graphs.

ddzipp commented 1 year ago

Hello, I am also interested in this question. Have you figured out anything about this? I am mainly interested in fine-tuning a model that can write code to generate tables and graphs.

Hello, I am glad to share my own experience with you! I’ve already trained my own model which specializes in answering cyber security questions. We organized a dataset which contains 700+ examples with the help of ChatGPT (Using self-instruct and some scripts). We trained our model on our own dataset and its performance for answering some cybersecurity questions is quite good. More information about our model and dataset are listed in my repository https://github.com/ddzipp/AutoAudit