skywalker023 / fantom

👻 Code and benchmark for our EMNLP 2023 paper - "FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions"
https://aclanthology.org/2023.emnlp-main.890/
MIT License
49 stars 3 forks source link

Did you try few-shot prompting GPT-4? #2

Open lukasberglund opened 7 months ago

lukasberglund commented 7 months ago

Hi! I enjoyed reading your paper. I also appreciate that you provided all your code. I suspect that GPT-4 would do a lot better at some of the questions (for example the accessibility questions) if you gave it a few-shot prompt (e.g. a five shot prompt). Did you try this out at all? If so, how well did models do?

skywalker023 commented 7 months ago

Hello, thanks for your interest in our paper and a great question! Indeed, GPT-4 will do a lot better if you give few-shot examples. However, giving few-shot examples for theory-of-mind questions means you are trying to make the model directly depend on lower-level processes (e.g., shortcut pattern matching). This is violating the "mentalizing" criteria for ToM validation, which is mentioned in our paper. Hope this helps!