Idea for Evals: improve abstract logic abilities

Describe the feature or improvement you're requesting

I think I found a way how to improve abstract logic and analogies understanding by GPT4. I run a lot of different tests for logical reasoning and I think at general GPT4 solves most of them very well. But it often fails on some kinds of abstract logic relationships retrieval. For such tasks GPT4 often fails even when I ask to provide details and think step by step, even when it's a very short task.

Here is a typical task:

You have a pair of words and you need to find closest "analogue" from the given list of words. For example if you have a pair "Physics-Science" and have a list: "Light - dark" "Raspberry - berry" "Sea - ocean" "A sheep - a flock" You should choose "Raspberry - berry" pair, because Raspberry is a subset or a type of berry, just as Physics is a subset or a branch of Science. You probably should mentally iterate all other possible answers to be sure you selected the correct one.

Type of tasks where GPT4 always fails:

system: You have to determine the relationship between the words in the pair, then find the 'analogue', that is, choose among the options 'cipher' - a pair of words with the same logical connection. In your answer, specify only the analogy pair of words with the same logical connection, without explanations and extra text. user: A pair to determine the relationship: 'Country - city' Potential analogues: Light - dark Enemy - foe raspberry - berry Sea - ocean. A sheep - a flock. The correct answer here is "Sea - ocean", but GPT4 usually answers with "A sheep - a flock" or "raspberry - berry". The difference here in the fact that Country not consist only of cities. Also there are other logical relations based on which sea-ocean is closer. I can explain deeper if needed, tests based on Korobkova method, and has a scientific background. GPT4 usually not able to correctly recognize these relation types: "A sheep - a flock", "Sea - ocean", "Enemy-foe", and sometimes "Light - dark" and "raspberry - berry". You can check more examples here: PR:806

Value

I'm quite sure if GPT4 will be able to learn this, it will deeper understand context and relations between notions. It will better understand abstract logic and analogies. The impact of this improvement is abstract and will improve the GPT4 in many fields (if it's possible to learn from such examples).

Additional context

I have already implemented it here, but I think gpt will need more examples to improve abstract logic abilities. Is it worth efforts?

I think a part of the issue above is that it's not the best example, as berries are a shorter form of the word raspberry and city is a shorter word than country. You didn't specify what the logical connection was and left it open to interpretation due to the ambiguity. Also, you don't explain that order is important. TBH, 'sheep flock' is a better answer than sea-ocean as a sheep is a member of a flock, like a city is a member of a country - IMHO.

I've found GPT4 frightfully good at logic though. You may struggle to come up with good examples here.

Most of the issues I've seen where it fails and merged seem to be things like decoding/encoding and spelling. it weirdly hallucinates a lot with spelling, especially when different languages are involved. Multi step logic is probably good too, though I think it might be more an issue of decoding when it fails rather than the logic itself.

Try this prompt:

You have to determine the relationship between the meanings of the words in the pair, then find the 'analogue', that is, choose among the options 'cipher' - a pair of words with the same logical connection in the same order. In your answer, specify only the analogy pair of words with the same logical connection, without explanations and extra text. user: A pair to determine the relationship: 'Country - city' Potential analogues: Light - dark Enemy - foe raspberry - berry Sea - ocean. A sheep - a flock.

Also, make sure you 'debug' your question. Ask GPT4 why it might answer in a particular way. You may have to rephrase your question about why it hallucinated a bunch of different times, it can be tricky to get GPT4 to admit to anything other than just 'I was wrong.' But usually you can get it to make up some excuse which can be used to fine tune your question.

eg:

why did you pick raspberry - berry above - what are some possible reasonings?

I picked "raspberry - berry" because it has a similar logical relationship to the original pair "Country - city." In both pairs, the second word is a specific example or subset of the first word. A city is a specific example of a country, and a raspberry is a specific example of a berry. The other options didn't have this particular relationship between the two words.

I sort of see where it might be going with that, especially since you don't really have a sea inside of an ocean, they're just two different words to describe a body of water.

Try adding "Ocean - island." to your list. It will get that one everytime.

@qrdlgit and @Ein-Tim thank you for your reply and interest!

You didn't specify what the logical connection was and left it open to interpretation due to the ambiguity

And it was intentional. The actual idea of this task is to test does the tested person/ai can find closest analogy. And for simple cases gpt can do it well. So if we want to help gpt learn something new, we need to test tricky examples, but they also have definite answer, despite the fact that all categories might fit, one of categories fits much better than others, and we can prove it.

Let me explain deeper.

A pair to determine the relationship: 'Country - city' Potential analogues:

Light - dark.

This relation represents antonyms. It's hard to find a logic in which 'Country - city' will be antonyms. We can use something like "Country" is big compared to city, which is small, but it's a very far analogy.

Enemy - foe.

This relation represents synonyms. 'Country - city' might be considered as synonyms by the fact that both words represent the place where people of same country live. But we can say that it's a very far analogy, which needs extra context to apply.
'Country' might also be considered as a village, so country-city becomes more synonymous, but it's a part of this test, to retrieve main sense and relation. So when you tell to the person "Country - city", he would say that it's more likely that country meant as a political and geography entity rather than village, it's a part of RLHF.

raspberry - berry.

This relation represents a type and it's subtype, it's important that in such pair they represent more abstract types than real world items (until we bring new context). City might be considered as small country, and in that case we can say it's a subtype, but it's a very specific case, so we need extra context to apply this analogy, so it's also far analogy.
'raspberry - berry' also represents that word might be a substring of another word with close meaning (and it's good to keep it to extend variations between which gpt should choose). This logic not applicable for country-city.
We can find more relations here, but they will be further along in meaning in current context.

Sea - ocean.

This relation represents the part of something bigger and big itself, but the big not consist only of these parts. It's important that in general case the sea is not a subtype of ocean, that's the difference with 'raspberry - berry'. City might be considered as a part of country (which is bigger), and country consist not only of cities (but it depends on context, in general context we would say like there are other things in country except cities). So this logic perfectly fits and don't require extra context.
More detailed relation of 'Sea - ocean': relation of some amount of water on Earth and bigger amount of water on Earth with some borders. We don't consider it, because previous one is better.
We can find more relations here, but they will be further along in meaning in current context.

A sheep - a flock.

This relation represents how multiple instances of single thing can create new notion. Multiple cities might form a country, but there is needed something else except cities to form a country - territory, people, etc. (depends on context). Country consist not only of cities, so this relation is more far then 'sea-ocean'. It's even hard to find context, in which we can exclude everything except cities considering what country consist of.
More detailed relation: relation of how single animal called and how they are called when they grouped. We don't consider it, because previous one is better.
More abstract relation: how separated entities can be a part of something bigger. This relation is too abstract and far from a given pair "A sheep - a flock", we excluded important property that flock consist only of sheep to get there, and now it become more "sea - ocean" relation. But "sea - ocean" doesn't require extra abstraction steps like we did here, so "sea - ocean" is closer.
We can find more relations here, but they will be further along in meaning in current context.

I hope I proved well why the 'sea - ocean' is much better than 'A sheep - a flock' in such case. You can also check Korobkova method, this test is taken from her works. Unfortunately I didn't found it translated to English. Just to remind, GPT usually answers with "raspberry - berry" (I will describe it's logic below in my answer). I can even proof that "Physics-Science" might be applied to multiple categories of these, but there is only one best suiting category. And it's a part of RLHF to understand context, analogies and abstract logic better.

Debugging same example 'Country - city' with GPT4

I updated prompt with your proposal and asked to be detailed: prompt: You have to determine the relationship between the meanings of the words in the pair, then find the 'analogue', that is, choose among the options 'cipher' - a pair of words with the same logical connection in the same order. In your answer, specify only the analogy pair of words with the same logical connection, please think step by steps in details. user: A pair to determine the relationship: 'Country - city' Potential analogues: Light - dark Enemy - foe raspberry - berry Sea - ocean. A sheep - a flock.

For "Sea - ocean" it says:

Seas are smaller than oceans, but an ocean doesn't contain a sea within it.

The funny fact is that when I ask in separate chat "Does ocean contain sea within it?" it answers yes with detailed explanation, examples and proofs. And if I point to this error in current conversation with same question gpt also fixes answer. And it's a correct fact, that sea is a part of ocean, you can check it. You propose to test 'island' instead of sea, but sea is more part of ocean than island. I tested 'island-ocean' and it works better for 'country-city', but fails more for other relations which are not geographically related. I think it's better to give general examples where gpt can learn, instead of trying to find tests where it answers good because we help with hints by bringing more context.

For "A sheep - a flock" it says:

A sheep - a flock: A flock contains multiple sheep, so the relationship is similar to a larger area containing a smaller area.

For raspberry - berry is says:

"raspberry - berry": Raspberry is a type of berry, so the relationship is the same as a smaller area being part of a larger area. The most appropriate analogue is: raspberry - berry

This logic looks belonging to sea-ocean, isn't it? Why it maps it to "raspberry - berry" even in detailed explanation case? I can discuss with gpt this case and help to find mistakes and correct answer, but for tests I written I always need to provide a lot of significant hints to help gpt to find correct answer. So it looks like gpt has something to learn.

Try this prompt:

Your prompt makes response less reasonable, it answers with "Raspberry - berry" and it worse then with my usual prompt when I ask for detailed explanation or without it. But let me test more. I tried to change prompt a lot of times, current one is the best I found which gives good answers and we can check reply via "Match" metric.

I spent a lot of time to optimize tests to exclude ambiguity as much as possible, so it's possible to prove with high certainty which answer is best among others. If you find some problematic tests - please show me, will be happy to improve tests.

You are correct, my prompt addition didn't help. I think adding 'Ocean island' is probably a more appropriate resolution here.

As an aside, I just got out my intro to symbolic logic and threw some of the trickiest propositional logic proofs I could find in it at GPT4. It was able to answer all of them, no explanation needed. Geez.

Yep, I also found that GPT4 extremely good in logic tasks. Abstract logic from my evals is hard to formalize, that's probably why it's struggling with it. And actually it works for most of the cases even for it.

Regarding 'Ocean-island' the problem here is that it's hard to understand is the island a part of ocean. From common context it's not clear. But the sea is a part of ocean by definition. So I still think it's better. Also I tested a bit ocean-island, and it started to fail on other tests, where didn't fail before.

Anyway @qrdlgit and @Ein-Tim tthank you for attention. Hope someone from maintainers will answer does it worth more efforts to improve tests or will review PR

openai / evals