tatsu-lab / stanford_alpaca

Code and documentation to train Stanford's Alpaca models, and generate the data.
https://crfm.stanford.edu/2023/03/13/alpaca.html
Apache License 2.0
29.11k stars 4.02k forks source link

Topic/Category for Instruction #166

Open adithyab94 opened 1 year ago

adithyab94 commented 1 year ago

Hello,

I want to express my appreciation for the amazing dataset. I am curious if the dataset's creators or anyone has attempted to classify the instructions into different topics, (eg, Science, Programming, Maths, Sports etc).

This information would be useful in developing a similar dataset for low-resource languages where even ChatGPT's performance is poor when prompted in one of these languages. Additionally, it would be beneficial to examine in which areas LLMs struggles with low resource (for instance, Programming prompts are usually not ideal for low-resource languages). Also any suggestions on how I can do this task myself would be greatly appreciated.

Thanks

Example

{
"instruction": "Provide a CSS code for making all text boxes visible on the page.",
 "input": "",
 "output": "The CSS code for making all text boxes visible on the page is:....",
 "topic": "programming"
}
rayrayraykk commented 1 year ago

+1