Closed MovGP0 closed 1 month ago
Example from Falsehoods Programmers Believe About Systems of Measurement:
Falsehood
There is no need to differentiate between absolute and relative measurements.
Question:
What is the sum of 0 °C and 0 °C
Agent:
The sum of 0 °C and 0 °C is 0 °C.
Correct answer:
0 °C = 273.15 K 0 °C + 0 °C = 273.15 K + 273.15 K = 546.3 K 546.3 K = 273.15 °C
Note: almost all people get this wrong, because they think that Degrees Celsius work like Kelvins, but they don't. Degrees Celsius refers to an relative measurement, while Kelvins refer to an absolute measurement.
Falsehood:
Objects in rest relative to the earth surface experience no acceleration
Question:
What is the acceleration of a stone that is firmly placed on the ground on the surface of earth?
Agent:
The acceleration of a stone that is firmly placed on the ground on the surface of the Earth is zero. This is because the stone is at rest, and there is no net force acting on it to cause it to accelerate.
Correct answer:
The ground is acting a force onto the stone that prevents it from free fall, accelerating the stone with approximately 9.8 m/s² away from the earth (gravitational) centre.
Example from Falsehoods Programmers Believe About Names
Question:
I am designing a web page for for entering the personal data of newborns. Give me an example of the HTML I need.
Agent:
<!-- truncated for simplicity -->
<label for="firstName">First Name:</label>
<input type="text" id="firstName" name="firstName" required>
<label for="lastName">Last Name:</label>
<input type="text" id="lastName" name="lastName" required>
Critique:
Not all newborns have a name immediately after birth, as parents might still be undecided. Consider making the first and last name fields optional (required should be removed). Some cultures do not have a concept of a "first name" and "last name." Include one or more fields for middle names, allowing flexibility for multiple middle names. ...
I don't think this is in the scope of SWE-bench. If these things were relevant to the respective repositories from which the task instances are sampled, they would/should have unit tests for them. I don't think any of the three examples applies to any of the repositories either.
I don't think this is in the scope of SWE-bench.
I agree. There might be the need for yet another benchmark. It seems like something that all current benchmarks are missing.
Note that my examples given have a real impact on software development:
Those examples is something that current AI and most humans fail, but AGI should be able to figure it out.
I'm closing this as unplanned.
Describe the feature
Large language models learn from people and there are issues where most people are wrong. This bias is translated into the language models. Even when the model is able to mathematically proof that it's wrong, it often rationalizes an incorrect assumption it made prior to the proof.
Potential Solutions
Create test scenarios that check for typical falsehoods. You can find a curated list of typical falsehoods here:
It should be possible to create automated tests from this, which can be used for AI model benchmarking.
Setup
A example setup might look like the following:
Example
Falsehood:
Question:
Answer Agent:
Critique Agent: