Closed interactivetech closed 1 year ago
This isn't currently an option, see also related https://github.com/jondurbin/airoboros/issues/12
I am, however, working on adding a version of this to this tool. It will be similar to what Meta did with Humpback - generate questions/instructions for which the response is the content you already have.
For example, you provide a section of code, it would generate an instruction similar to "write a python script that does [x]", so your content is the target response and the LLM will produce the questions/instructions. It won't be perfect, and it will be somewhat limited in scope initially because of how difficult it would be to properly segment raw data, but it will at least be something to start with.
Thanks for the info! Will close the issue.
Awesome work with airoboros! I am interested in creating datasets for instruction tuning based on domain specific content (API documentation, python codebase, PDFs). What would be the best way to provide documents and domain information to create instruction tuned datasets. Have you seen other codebases or papers that achieve this?