openai / evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Other
15.11k stars 2.62k forks source link

Idea for Evals: improve abstract logic abilities #848

Open karinageneraly opened 1 year ago

karinageneraly commented 1 year ago

Describe the feature or improvement you're requesting

I think I found a way how to improve abstract logic and analogies understanding by GPT4. I run a lot of different tests for logical reasoning and I think at general GPT4 solves most of them very well. But it often fails on some kinds of abstract logic relationships retrieval. For such tasks GPT4 often fails even when I ask to provide details and think step by step, even when it's a very short task.

Here is a typical task:

You have a pair of words and you need to find closest "analogue" from the given list of words. For example if you have a pair "Physics-Science" and have a list: "Light - dark" "Raspberry - berry" "Sea - ocean" "A sheep - a flock" You should choose "Raspberry - berry" pair, because Raspberry is a subset or a type of berry, just as Physics is a subset or a branch of Science. You probably should mentally iterate all other possible answers to be sure you selected the correct one.

Type of tasks where GPT4 always fails:

system: You have to determine the relationship between the words in the pair, then find the 'analogue', that is, choose among the options 'cipher' - a pair of words with the same logical connection. In your answer, specify only the analogy pair of words with the same logical connection, without explanations and extra text. user: A pair to determine the relationship: 'Country - city' Potential analogues: Light - dark Enemy - foe raspberry - berry Sea - ocean. A sheep - a flock. The correct answer here is "Sea - ocean", but GPT4 usually answers with "A sheep - a flock" or "raspberry - berry". The difference here in the fact that Country not consist only of cities. Also there are other logical relations based on which sea-ocean is closer. I can explain deeper if needed, tests based on Korobkova method, and has a scientific background. GPT4 usually not able to correctly recognize these relation types: "A sheep - a flock", "Sea - ocean", "Enemy-foe", and sometimes "Light - dark" and "raspberry - berry". You can check more examples here: PR:806

Value

I'm quite sure if GPT4 will be able to learn this, it will deeper understand context and relations between notions. It will better understand abstract logic and analogies. The impact of this improvement is abstract and will improve the GPT4 in many fields (if it's possible to learn from such examples).

Additional context

I have already implemented it here, but I think gpt will need more examples to improve abstract logic abilities. Is it worth efforts?

qrdlgit commented 1 year ago

I think a part of the issue above is that it's not the best example, as berries are a shorter form of the word raspberry and city is a shorter word than country. You didn't specify what the logical connection was and left it open to interpretation due to the ambiguity. Also, you don't explain that order is important. TBH, 'sheep flock' is a better answer than sea-ocean as a sheep is a member of a flock, like a city is a member of a country - IMHO.

I've found GPT4 frightfully good at logic though. You may struggle to come up with good examples here.

Most of the issues I've seen where it fails and merged seem to be things like decoding/encoding and spelling. it weirdly hallucinates a lot with spelling, especially when different languages are involved. Multi step logic is probably good too, though I think it might be more an issue of decoding when it fails rather than the logic itself.

Try this prompt:

You have to determine the relationship between the meanings of the words in the pair, then find the 'analogue', that is, choose among the options 'cipher' - a pair of words with the same logical connection in the same order. In your answer, specify only the analogy pair of words with the same logical connection, without explanations and extra text. user: A pair to determine the relationship: 'Country - city' Potential analogues: Light - dark Enemy - foe raspberry - berry Sea - ocean. A sheep - a flock.

qrdlgit commented 1 year ago

Also, make sure you 'debug' your question. Ask GPT4 why it might answer in a particular way. You may have to rephrase your question about why it hallucinated a bunch of different times, it can be tricky to get GPT4 to admit to anything other than just 'I was wrong.' But usually you can get it to make up some excuse which can be used to fine tune your question.

eg:

why did you pick raspberry - berry above - what are some possible reasonings?

I picked "raspberry - berry" because it has a similar logical relationship to the original pair "Country - city." In both pairs, the second word is a specific example or subset of the first word. A city is a specific example of a country, and a raspberry is a specific example of a berry. The other options didn't have this particular relationship between the two words.

..

I sort of see where it might be going with that, especially since you don't really have a sea inside of an ocean, they're just two different words to describe a body of water.

Try adding "Ocean - island." to your list. It will get that one everytime.

karinageneraly commented 1 year ago

@qrdlgit and @Ein-Tim thank you for your reply and interest!

You didn't specify what the logical connection was and left it open to interpretation due to the ambiguity

And it was intentional. The actual idea of this task is to test does the tested person/ai can find closest analogy. And for simple cases gpt can do it well. So if we want to help gpt learn something new, we need to test tricky examples, but they also have definite answer, despite the fact that all categories might fit, one of categories fits much better than others, and we can prove it.

Let me explain deeper.

A pair to determine the relationship: 'Country - city' Potential analogues:

  1. Light - dark.
  1. Enemy - foe.
  1. raspberry - berry.
  1. Sea - ocean.
  1. A sheep - a flock.

I hope I proved well why the 'sea - ocean' is much better than 'A sheep - a flock' in such case. You can also check Korobkova method, this test is taken from her works. Unfortunately I didn't found it translated to English. Just to remind, GPT usually answers with "raspberry - berry" (I will describe it's logic below in my answer). I can even proof that "Physics-Science" might be applied to multiple categories of these, but there is only one best suiting category. And it's a part of RLHF to understand context, analogies and abstract logic better.

Debugging same example 'Country - city' with GPT4

I updated prompt with your proposal and asked to be detailed: prompt: You have to determine the relationship between the meanings of the words in the pair, then find the 'analogue', that is, choose among the options 'cipher' - a pair of words with the same logical connection in the same order. In your answer, specify only the analogy pair of words with the same logical connection, please think step by steps in details. user: A pair to determine the relationship: 'Country - city' Potential analogues: Light - dark Enemy - foe raspberry - berry Sea - ocean. A sheep - a flock.

For "Sea - ocean" it says:

Seas are smaller than oceans, but an ocean doesn't contain a sea within it.

The funny fact is that when I ask in separate chat "Does ocean contain sea within it?" it answers yes with detailed explanation, examples and proofs. And if I point to this error in current conversation with same question gpt also fixes answer. And it's a correct fact, that sea is a part of ocean, you can check it. You propose to test 'island' instead of sea, but sea is more part of ocean than island. I tested 'island-ocean' and it works better for 'country-city', but fails more for other relations which are not geographically related. I think it's better to give general examples where gpt can learn, instead of trying to find tests where it answers good because we help with hints by bringing more context.

For "A sheep - a flock" it says:

A sheep - a flock: A flock contains multiple sheep, so the relationship is similar to a larger area containing a smaller area.

For raspberry - berry is says:

"raspberry - berry": Raspberry is a type of berry, so the relationship is the same as a smaller area being part of a larger area. The most appropriate analogue is: raspberry - berry

This logic looks belonging to sea-ocean, isn't it? Why it maps it to "raspberry - berry" even in detailed explanation case? I can discuss with gpt this case and help to find mistakes and correct answer, but for tests I written I always need to provide a lot of significant hints to help gpt to find correct answer. So it looks like gpt has something to learn.

Try this prompt:

Your prompt makes response less reasonable, it answers with "Raspberry - berry" and it worse then with my usual prompt when I ask for detailed explanation or without it. But let me test more. I tried to change prompt a lot of times, current one is the best I found which gives good answers and we can check reply via "Match" metric.

I spent a lot of time to optimize tests to exclude ambiguity as much as possible, so it's possible to prove with high certainty which answer is best among others. If you find some problematic tests - please show me, will be happy to improve tests.

qrdlgit commented 1 year ago

You are correct, my prompt addition didn't help. I think adding 'Ocean island' is probably a more appropriate resolution here.

As an aside, I just got out my intro to symbolic logic and threw some of the trickiest propositional logic proofs I could find in it at GPT4. It was able to answer all of them, no explanation needed. Geez.

karinageneraly commented 1 year ago

Yep, I also found that GPT4 extremely good in logic tasks. Abstract logic from my evals is hard to formalize, that's probably why it's struggling with it. And actually it works for most of the cases even for it.

Regarding 'Ocean-island' the problem here is that it's hard to understand is the island a part of ocean. From common context it's not clear. But the sea is a part of ocean by definition. So I still think it's better. Also I tested a bit ocean-island, and it started to fail on other tests, where didn't fail before.

Anyway @qrdlgit and @Ein-Tim tthank you for attention. Hope someone from maintainers will answer does it worth more efforts to improve tests or will review PR