jondurbin / airoboros

Customizable implementation of the self-instruct paper.
Apache License 2.0
1.02k stars 71 forks source link

[Question] How to create instruction datasets based on domain specific information. #21

Closed interactivetech closed 1 year ago

interactivetech commented 1 year ago

Awesome work with airoboros! I am interested in creating datasets for instruction tuning based on domain specific content (API documentation, python codebase, PDFs). What would be the best way to provide documents and domain information to create instruction tuned datasets. Have you seen other codebases or papers that achieve this?

jondurbin commented 1 year ago

This isn't currently an option, see also related https://github.com/jondurbin/airoboros/issues/12

I am, however, working on adding a version of this to this tool. It will be similar to what Meta did with Humpback - generate questions/instructions for which the response is the content you already have.

For example, you provide a section of code, it would generate an instruction similar to "write a python script that does [x]", so your content is the target response and the LLM will produce the questions/instructions. It won't be perfect, and it will be somewhat limited in scope initially because of how difficult it would be to properly segment raw data, but it will at least be something to start with.

interactivetech commented 1 year ago

Thanks for the info! Will close the issue.