teacherpeterpan / Logic-LLM

The project page for "LOGIC-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning"
MIT License
259 stars 41 forks source link

Will Pyke always give the correct answer for the query #1

Open Some-random opened 1 year ago

Some-random commented 1 year ago

I'm conducting trials with ProofWriter and stumbled upon an instance where, despite the formalization appearing accurate, it's yielding inaccurate results.

The Problem is is ProofWriter_RelNoneg-OWA-D5-775_Q9, it's taken from the development set.

The textual context is:

The dog chases the squirrel
The dog chases the tiger
The dog eats the tiger
The dog is young
The dog needs the squirrel
The rabbit chases the dog
The rabbit chases the tiger
The squirrel chases the dog
The squirrel is round
The squirrel needs the rabbit
The squirrel needs the tiger
The tiger chases the dog
The tiger eats the dog
The tiger eats the squirrel
The tiger is green
The tiger is round
If someone is green then they eat the rabbit
If someone needs the squirrel and the squirrel is green then they need the tiger
If someone eats the rabbit then they need the squirrel
If the tiger needs the rabbit then the rabbit is green
If someone is nice then they need the rabbit
If the tiger is round then the tiger is young
If someone chases the tiger and the tiger is kind then the tiger eats the rabbit
If someone needs the squirrel then they are nice
If someone chases the tiger then the tiger chases the squirrel

The question is: Based on the above information, is the following statement true, false, or unknown? The tiger needs the rabbit.

The formalized Prolog code is:

Predicates:
Chases($x, $y, bool) ::: Does x chase y?
Eats($x, $y, bool) ::: Does x eat y?
Young($x, bool) ::: Is x young?
Needs($x, $y, bool) ::: Does x need y?
Round($x, bool) ::: Is x round?
Green($x, bool) ::: Is x green?
Nice($x, bool) ::: Is x nice?
Kind($x, bool) ::: Is x kind?

Facts:
Chases(Dog, Squirrel, True) ::: The dog chases the squirrel.
Chases(Dog, Tiger, True) ::: The dog chases the tiger.
Eats(Dog, Tiger, True) ::: The dog eats the tiger.
Young(Dog, True) ::: The dog is young.
Needs(Dog, Squirrel, True) ::: The dog needs the squirrel.
Chases(Rabbit, Dog, True) ::: The rabbit chases the dog.
Chases(Rabbit, Tiger, True) ::: The rabbit chases the tiger.
Chases(Squirrel, Dog, True) ::: The squirrel chases the dog.
Round(Squirrel, True) ::: The squirrel is round.
Needs(Squirrel, Rabbit, True) ::: The squirrel needs the rabbit.
Needs(Squirrel, Tiger, True) ::: The squirrel needs the tiger.
Chases(Tiger, Dog, True) ::: The tiger chases the dog.
Eats(Tiger, Dog, True) ::: The tiger eats the dog.
Eats(Tiger, Squirrel, True) ::: The tiger eats the squirrel.
Green(Tiger, True) ::: The tiger is green.
Round(Tiger, True) ::: The tiger is round.

Rules:
Green($x, True) >>> Eats($x, Rabbit, True) ::: If someone is green then they eat the rabbit.
Needs($x, Squirrel, True) && Green(Squirrel, True) >>> Needs($x, Tiger, True) ::: If someone needs the squirrel and the squirrel is green then they need the tiger.
Eats($x, Rabbit, True) >>> Needs($x, Squirrel, True) ::: If someone eats the rabbit then they need the squirrel.
Needs(Tiger, Rabbit, True) >>> Green(Rabbit, True) ::: If the tiger needs the rabbit then the rabbit is green.
Nice($x, True) >>> Needs($x, Rabbit, True) ::: If someone is nice then they need the rabbit.
Round(Tiger, True) >>> Young(Tiger, True) ::: If the tiger is round then the tiger is young.
Chases($x, Tiger, True) && Kind(Tiger, True) >>> Eats(Tiger, Rabbit, True) ::: If someone chases the tiger and the tiger is kind then the tiger eats the rabbit.
Needs($x, Squirrel, True) >>> Nice($x, True) ::: If someone needs the squirrel then they are nice.
Chases($x, Tiger, True) >>> Chases(Tiger, Squirrel, True) ::: If someone chases the tiger then the tiger chases the squirrel.

Query:
Needs(Tiger, Rabbit, True) ::: The tiger needs the rabbit.

The predicted result and ground truth result are:

"answer": "A",
"predicted_answer": "C"

This raises the question: provided that all the facts, predicates, and rules are correctly formalized, can Pyke always generate the right answer for a query?

Some-random commented 1 year ago

Looks like the error mostly occurs when there are multiple potential reasoning paths for the problem (although only one of them would be correct). There are two more examples from dev set to back this assumption:

ProofWriter_RelNoneg-OWA-D5-861_Q3

The cat needs the dog
The dog is kind
The dog needs the rabbit
The rabbit eats the dog
The rabbit is green
The tiger chases the rabbit
The tiger is big
If something is nice and it eats the tiger then it needs the dog
If something chases the cat then it eats the cat
If the dog chases the tiger then the tiger needs the rabbit
If the rabbit chases the tiger then the rabbit chases the cat
If something chases the cat then the cat is kind
If something eats the dog then it chases the cat
If something is rough then it eats the dog
If something is kind then it is rough
If something eats the rabbit and the rabbit is big then it is kind

Based on the above information, is the following statement true, false, or unknown? The rabbit chases the cat.
Predicates:
Needs($x, $y, bool) ::: Does x need y?
Kind($x, bool) ::: Is x kind?
Eats($x, $y, bool) ::: Does x eat y?
Green($x, bool) ::: Is x green?
Chases($x, $y, bool) ::: Does x chase y?
Big($x, bool) ::: Is x big?
Nice($x, bool) ::: Is x nice?
Rough($x, bool) ::: Is x rough?
Facts:
Needs(Cat, Dog, True) ::: The cat needs the dog.
Kind(Dog, True) ::: The dog is kind.
Needs(Dog, Rabbit, True) ::: The dog needs the rabbit.
Eats(Rabbit, Dog, True) ::: The rabbit eats the dog.
Green(Rabbit, True) ::: The rabbit is green.
Chases(Tiger, Rabbit, True) ::: The tiger chases the rabbit.
Big(Tiger, True) ::: The tiger is big.
Rules:
Nice($x, True) && Eats($x, Tiger, True) >>> Needs($x, Dog, True) ::: If something is nice and it eats the tiger then it needs the dog.
Chases($x, Cat, True) >>> Eats($x, Cat, True) ::: If something chases the cat then it eats the cat.
Chases(Dog, Tiger, True) >>> Needs(Tiger, Rabbit, True) ::: If the dog chases the tiger then the tiger needs the rabbit.
Chases(Rabbit, Tiger, True) >>> Chases(Rabbit, Cat, True) ::: If the rabbit chases the tiger then the rabbit chases the cat.
Chases($x, Cat, True) >>> Kind(Cat, True) ::: If something chases the cat then the cat is kind.
Eats($x, Dog, True) >>> Chases($x, Cat, True) ::: If something eats the dog then it chases the cat.
Rough($x, True) >>> Eats($x, Dog, True) ::: If something is rough then it eats the dog.
Kind($x, True) >>> Rough($x, True) ::: If something is kind then it is rough.
Eats($x, Rabbit, True) && Big(Rabbit, True) >>> Kind($x, True) ::: If something eats the rabbit and the rabbit is big then it is kind.
Query:
Chases(Rabbit, Cat, True) ::: The rabbit chases the cat.

ProofWriter_RelNoneg-OWA-D5-745_Q5:

The bear is round
The bear likes the cow
The bear visits the lion
The cow is big
The cow is rough
The lion needs the bear
The mouse likes the lion
If someone visits the lion then they are round
If someone needs the bear then they need the lion
If someone needs the lion then they visit the lion
If someone visits the lion and the lion is round then they like the lion
If someone is rough then they need the lion
If someone likes the lion and they need the lion then the lion visits the mouse
If someone is rough then they visit the lion
If someone likes the lion and the lion is big then they are kind

Based on the above information, is the following statement true, false, or unknown? The lion visits the lion.
Predicates:
Round($x, bool) ::: Is x round?
Likes($x, $y, bool) ::: Does x like y?
Visits($x, $y, bool) ::: Does x visit y?
Big($x, bool) ::: Is x big?
Rough($x, bool) ::: Is x rough?
Needs($x, $y, bool) ::: Does x need y?
Kind($x, bool) ::: Is x kind?

Facts:
Round(Bear, True) ::: The bear is round.
Likes(Bear, Cow, True) ::: The bear likes the cow.
Visits(Bear, Lion, True) ::: The bear visits the lion.
Big(Cow, True) ::: The cow is big.
Rough(Cow, True) ::: The cow is rough.
Needs(Lion, Bear, True) ::: The lion needs the bear.
Likes(Mouse, Lion, True) ::: The mouse likes the lion.

Rules:
Visits($x, Lion, True) >>> Round($x, True) ::: If someone visits the lion then they are round.
Needs($x, Bear, True) >>> Needs($x, Lion, True) ::: If someone needs the bear then they need the lion.
Needs($x, Lion, True) >>> Visits($x, Lion, True) ::: If someone needs the lion then they visit the lion.
Visits($x, Lion, True) && Round(Lion, True) >>> Likes($x, Lion, True) ::: If someone visits the lion and the lion is round then they like the lion.
Rough($x, True) >>> Needs($x, Lion, True) ::: If someone is rough then they need the lion.
Likes($x, Lion, True) && Needs($x, Lion, True) >>> Visits(Lion, Mouse, True) ::: If someone likes the lion and they need the lion then the lion visits the mouse.
Rough($x, True) >>> Visits($x, Lion, True) ::: If someone is rough then they visit the lion.
Likes($x, Lion, True) && Big(Lion, True) >>> Kind($x, True) ::: If someone likes the lion and the lion is big then they are kind.

Query:
Visits(Lion, Lion, True) ::: The lion visits the lion.
Shujie24 commented 1 year ago

I encountered the same issue, when solving ProntoQA using pyke. One typical example is the following. The problem is ProntoQA_10:

{
    "id": "ProntoQA_10",
    "context": "Every impus is earthy. Each impus is a jompus. Jompuses are small. Jompuses are rompuses. Rompuses are not amenable. Rompuses are wumpuses. Wumpuses are wooden. Wumpuses are zumpuses. Every zumpus is temperate. Every zumpus is a dumpus. Dumpuses are dull. Dumpuses are vumpuses. Every vumpus is not shy. Every yumpus is sweet. Vumpuses are numpuses. Numpuses are not sweet. Numpuses are tumpuses. Fae is a wumpus.",
    "question": "Is the following statement true or false? Fae is sweet.",
    "options": [
      "A) True",
      "B) False"
    ],
    "answer": "B",
    "explanation": [
      "Fae is a wumpus.",
      "Wumpuses are zumpuses.",
      "Fae is a zumpus.",
      "Every zumpus is a dumpus.",
      "Fae is a dumpus.",
      "Dumpuses are vumpuses.",
      "Fae is a vumpus.",
      "Vumpuses are numpuses.",
      "Fae is a numpus.",
      "Numpuses are not sweet.",
      "Fae is not sweet."
    ]
  }

And the formation from natural language to program is also correct:

fact1
    foreach
        facts.Impus($x, True)
    assert
        facts.Earthy($x, True)

fact2
    foreach
        facts.Impus($x, True)
    assert
        facts.Jompus($x, True)

fact3
    foreach
        facts.Jompus($x, True)
    assert
        facts.Small($x, True)

fact4
    foreach
        facts.Jompus($x, True)
    assert
        facts.Rompus($x, True)

fact5
    foreach
        facts.Rompus($x, True)
    assert
        facts.Amenable($x, False)

fact6
    foreach
        facts.Rompus($x, True)
    assert
        facts.Wumpus($x, True)

fact7
    foreach
        facts.Wumpus($x, True)
    assert
        facts.Wooden($x, True)

fact8
    foreach
        facts.Wumpus($x, True)
    assert
        facts.Zumpus($x, True)

fact9
    foreach
        facts.Zumpus($x, True)
    assert
        facts.Temperate($x, True)

fact10
    foreach
        facts.Zumpus($x, True)
    assert
        facts.Dumpus($x, True)

fact11
    foreach
        facts.Dumpus($x, True)
    assert
        facts.Dull($x, True)

fact12
    foreach
        facts.Dumpus($x, True)
    assert
        facts.Vumpus($x, True)

fact13
    foreach
        facts.Vumpus($x, True)
    assert
        facts.Shy($x, False)

fact14
    foreach
        facts.Yumpus($x, True)
    assert
        facts.Sweet($x, True)

fact15
    foreach
        facts.Vumpus($x, True)
    assert
        facts.Numpus($x, True)

fact16
    foreach
        facts.Numpus($x, True)
    assert
        facts.Sweet($x, False)

fact17
    foreach
        facts.Numpus($x, True)
    assert
        facts.Tumpus($x, True)

But after giving these to pyke, the output prediction is A rather than the correct answer B. Is there some problems with pyke? Is pyke giving the correct answer?