refuel-ai / autolabel

Label, clean and enrich text datasets with LLMs.
https://docs.refuel.ai/
MIT License
1.93k stars 127 forks source link

[Feature Request]: Automatic prompt engineering. #449

Open turian opened 11 months ago

turian commented 11 months ago

Is your feature request related to a problem? Please describe.

Prompt engineering can be a bit of a fiddly process. I sometimes find a label error, and then go back to ChatGPT and give it the original prompt, it gives the wrong answer, and then I interrogate it until it admits it was wrong, and then ask it to rewrite the prompt to make it clearer. This is fiddly and kind of hit-or-miss.

I would be quite cool if there were an automated way to do prompt engineering.

I know there are a handful of ways to attack this, I'll try to describe them and let you evaluate what you think is a good start. I also understand if this is not an immediate priority for the team.

** Background

So I have a difficult classification problem. I noticed some pretty obvious errors in the output.

I tried adding chain-of-thought to classification but couldn't get it to work yet.

I also tried converting the problem to QA format, with chain-of-thought. I also noticed some pretty obvious errors. Interestingly, they didn't always overlap with classification errors.

I tried doing some prompt engineering but sort of gave up after a while, and just kept the answer where QA and classification matched. And then made a CSV file of "noisy examples" where QA + classification disagreed, to look at more closely later.

[NOTE: It would I think be a useful feature purely in itself to automate this, i.e. have a mode where I can have a classification + QA JSON and it finds the matches and the mismatches automatically. I can file a separate issue for this.]

BTW, it might technically be considered an entity matching task, because the input actually has two examples, but I didn't really see the benefit or difference from just doing it as classification. Maybe the distinction between the two task types can be made more clear? It seems they function the same.

Describe the solution you'd like.

Let's say you have your starting prompt and your seed set.

You also have a set of "hard" held out examples for prompt engineering. This can be obtain in two ways: 1) Do a small run on your test set and manually grab the rows that look wrong. 2) Do QA + classification and find the mismatches. 3) Run setfit, either yourself or as a refuel.ai service, using the seed as training and the autolabel on test as the evaluation set. (Alternately, do k-fold on the test to create a validation and test set.) Find the setfit labels that differ the most from the autolabel labels. Treat those as your hard examples.

You may or may not have the correct labels for the hard examples. If you have 30 or 50 hard examples, it's probably worth it that the experimenter goes and assigns correct gold labels to them.

Anyway, it would be great to have a prompt engineering mode where it uses the original prompt and seeds, and picks a random hard example. It then does a handful of follow-on queries asking: "Are you really sure? Why? Could it be possible that the right answer is actually [a different answer]." When GPT finally admits it was wrong, ask it to explain why. Then ask it to rewrite the prompt in such a way that it would have gotten the correct answer. Or suggests several alternate new prompts.

It then iterates until it has a much better prompt that would hopefully cover most of the hard examples.

Additional context

gpt-prompt-engineer was announced on HN today and seems quite useful. It generates new prompts using an ELO ranking. It's not 100% clear this is the best approach, but it's quite cool and I hadn't seen something like this before.

Anyway, an automated approach to prompt engineering to improve the clarity of the prompt and reduce the errors in the annotations would be great, with an eye towards fixing the hard examples that would get better labels if the prompt were better.

rishabh-bhargava commented 11 months ago

Another paper that has helpful information from @turian : https://arxiv.org/pdf/2010.15980.pdf