princeton-nlp / SWE-bench

[ICLR 2024] SWE-bench: Can Language Models Resolve Real-world Github Issues?
https://www.swebench.com
MIT License
2k stars 348 forks source link

Test for human falsehoods #208

Closed MovGP0 closed 1 month ago

MovGP0 commented 3 months ago

Describe the feature

Large language models learn from people and there are issues where most people are wrong. This bias is translated into the language models. Even when the model is able to mathematically proof that it's wrong, it often rationalizes an incorrect assumption it made prior to the proof.

Potential Solutions

Create test scenarios that check for typical falsehoods. You can find a curated list of typical falsehoods here:

It should be possible to create automated tests from this, which can be used for AI model benchmarking.

Setup

A example setup might look like the following:

  1. Question: Create questions based on a known falsehood
  2. Answer Agent: Use an agent of the model under test to answer the question
  3. Critique Agent: Use an agent to rate if the answer is correct or has a falsehood in it

Example

Falsehood:

valid email address exactly has one @ character.

Question:

How many @ sign does an valid email address have?

Answer Agent:

A valid email address should have exactly one "@" sign. Having more than one "@" sign or none at all would make the email address invalid according to standard email format rules.

Critique Agent:

The statement that a valid email address must contain exactly one "@" sign is an oversimplification and not entirely accurate. While it is true that most email addresses follow this rule, email addresses can technically include multiple "@" signs according to the relevant RFC standards. Therefore, when validating email addresses, it's important to consider edge cases and follow proper RFC guidelines rather than relying on simplistic rules.

MovGP0 commented 3 months ago

Example from Falsehoods Programmers Believe About Systems of Measurement:

Falsehood

There is no need to differentiate between absolute and relative measurements.

Question:

What is the sum of 0 °C and 0 °C

Agent:

The sum of 0 °C and 0 °C is 0 °C.

Correct answer:

0 °C = 273.15 K 0 °C + 0 °C = 273.15 K + 273.15 K = 546.3 K 546.3 K = 273.15 °C

Note: almost all people get this wrong, because they think that Degrees Celsius work like Kelvins, but they don't. Degrees Celsius refers to an relative measurement, while Kelvins refer to an absolute measurement.

MovGP0 commented 3 months ago

Falsehood:

Objects in rest relative to the earth surface experience no acceleration

Question:

What is the acceleration of a stone that is firmly placed on the ground on the surface of earth?

Agent:

The acceleration of a stone that is firmly placed on the ground on the surface of the Earth is zero. This is because the stone is at rest, and there is no net force acting on it to cause it to accelerate.

Correct answer:

The ground is acting a force onto the stone that prevents it from free fall, accelerating the stone with approximately 9.8 m/s² away from the earth (gravitational) centre.

MovGP0 commented 3 months ago

Example from Falsehoods Programmers Believe About Names

Question:

I am designing a web page for for entering the personal data of newborns. Give me an example of the HTML I need.

Agent:

<!-- truncated for simplicity -->
<label for="firstName">First Name:</label>
<input type="text" id="firstName" name="firstName" required>
<label for="lastName">Last Name:</label>
<input type="text" id="lastName" name="lastName" required>

Critique:

Not all newborns have a name immediately after birth, as parents might still be undecided. Consider making the first and last name fields optional (required should be removed). Some cultures do not have a concept of a "first name" and "last name." Include one or more fields for middle names, allowing flexibility for multiple middle names. ...

klieret commented 3 months ago

I don't think this is in the scope of SWE-bench. If these things were relevant to the respective repositories from which the task instances are sampled, they would/should have unit tests for them. I don't think any of the three examples applies to any of the repositories either.

MovGP0 commented 3 months ago

I don't think this is in the scope of SWE-bench.

I agree. There might be the need for yet another benchmark. It seems like something that all current benchmarks are missing.

MovGP0 commented 3 months ago

Note that my examples given have a real impact on software development:

Those examples is something that current AI and most humans fail, but AGI should be able to figure it out.

carlosejimenez commented 1 month ago

I'm closing this as unplanned.