A question about how to judge the performance of the prompts after running "run_fsc.py" file?

mingkaid / rl-prompt

Accompanying repo for the RLPrompt paper

MIT License

286 stars 52 forks source link

A question about how to judge the performance of the prompts after running "run_fsc.py" file? #12

Closed jasonyin718 closed 1 year ago

jasonyin718 commented 1 year ago

Dear Author: I am very interested in your research in article RLPrompt: Optimizing Discrete Text Prompts With Reinforcement Learning , after carefully learning your code, I have a question and would like to ask you for it. After running the "run_fsc.py", a lot of json files containing 'output_tokens' and 'scores' are obtained. I would like to ask you about how to judge the performance of prompts? Should I run "run_eval. py" once for each prompt in all json files, and see the performance of the prompt from the Acc, or can I simply think that the output_tokens with high scores has better performance? Wishing for your replying!

mingkaid commented 1 year ago

Thanks for reaching out. When you first ran the script, were you prompted to set up your wandb account? Our code logs the training progress, e.g., accuracy, through this website by default.

If you are more comfortable with other trackers like TensorBoard, you can modify the code in the Trainer class to log to your preferred tracker.

If you don't want to use any tracker at all, you can also simply configure the PromptedClassificationReward class to directly print any quantity you want, but it'll be more difficult to visualize and spot trends.

Let me know if this makes snese. I'm closing this issue because it's a clarification question.

MM-IR commented 1 year ago

That is also an interesting question! If you want to pick some prompts, it is okay for you to pick prompts which show decent performance on 16-shot dev set and 16-shot training set. Usually, I prefer to picking prompts which show some high reward values on training set, and the reward scores are much more prominent than simple accuracy scores, since accuracy scores can be biased towards few-shot examples.

jasonyin718 commented 1 year ago

Yes, I agree with you a lot and I will try to follow your suggestion to select the prompts, then use these prompts on test set to see the performance of the these prompts.