Open tfriedel opened 1 year ago
Awesome, I was thinking of something similar, around a meta-model which attempted to understand which solution is correct.
For example, on yesterday's solution it mostly generated two answers — 1619 & 1623, because of the off-by-4 error that many humans made.
1623 was correct, but it submitted 1619 at first since that was slightly more popular.
If we had a question of "which of these two solutions is correct", giving it the python from each of the groups that generated the respective results, would it have been able to tell which was correct? Or, as you suggest, give it an answer we know is wrong and ask it to correct itself?
One implementation issue — there's no ChatGPT API, only one for the older, davinci model. So a) people should explore ChatGPT manually, and b) it may still be possible with the davinci model. I'd be very interested to see any approaches there (I probably won't get to it myself in the next couple of days but may have some time at the weekend — an opportunity to leapfrog me on the leaderboard for anyone keen!)
There are paper about this CoT (Chain of Thought) procedure, i.e. telling it to solve something step by step, which gives better results. I think Andrej Karpathy hit the nail on the head with this tweet:
The deepest unintuitive disconnect w.r.t. psychology of ChatGPT is that it doesn't get "time to think". It has a small, fixed amount of thought for each output token. A bit like human forced to speak very fast. Asking them to produce more text is giving them more time to think.
So this approach of letting it look over it's result again and try to find errors and fix them is kind of giving it more time to think. Not all prompts work equally well. I noticed just asking "are there any errors?" doesn't work nearly as good as "find all the errors" (i.e. giving it already the assumption there will be some).
For your example letting it rank solutions by highest chance of working could work, but no way of knowing until we try!
Have you tried getting it to do print debugging?
https://gist.github.com/geoffreylitt/6094cdb1efdb990af3da717d7d634065
That's a very good idea @wodin ! Thanks for the link, super interesting.
One inconvenient implementation detail re what's possible now — the GPT model with an API has a ~4000 token limit, which we're already hitting up against with the longer puzzles. ChatGPT (which doesn't have an API) is optimized for longer conversations, so doesn't have a strict limit.
So it might be difficult to implement something like this for this year — we'd have to find a way of telling it to not print too many debug messages...
FWIW I've been finding in the past few days that the davinci model has been really struggling (maybe good news for humans!). It does well in past years, including well past the current day. But either it had knowledge of those solutions from its training, or there has been successful puzzle adjustment to account for AI.
The bot is having some trouble with this one 😂
This one seems a bit like a CAPTCHA, so maybe they did do that to thwart bots.
When using ChatGPT to solve today's solution it's first attempt was buggy.
I then responded with: "Sorry, this is incorrect. Can you find the errors in your solution?"
It found and fixed the bugs. It may be better to use a multi step approach rather than starting from scratch every time.