nuprl / MultiPL-E

A multi-programming language benchmark for LLMs
https://nuprl.github.io/MultiPL-E/
Other
200 stars 38 forks source link

Task HumanEval/092 has contradictory tests in Rust #142

Open geajack opened 5 months ago

geajack commented 5 months ago

The Rust version of HumanEval/092 contains the following lines:

assert_eq!(candidate(3.0, 4.0, 7.0), true);
assert_eq!(candidate(3.0, 4.0, 7.0), false);

(I think this is row 67 of the huggingface dataset for multipl-E, but I haven't checked)

This obviously makes the tests unsatisfiable. It seems like this was a type-casting issue when translating from Python, the original tests read:

assert candidate(3,4,7)==True, "This prints if this assert fails 9 (also good for debugging!)"
assert candidate(3.0,4,7)==False, "This prints if this assert fails 10 (also good for debugging!)"
arjunguha commented 5 months ago

wow, thanks. yeah, we should make a decision on how to fix this. I'm going to guess that this affects other typed languages too.

arjunguha commented 5 months ago

have you see HumanEval+ btw? Does that address this?

geajack commented 5 months ago

No, I haven't looked into Eval+

arjunguha commented 5 months ago

The original Python problem barely makes sense in a typed language such as Rust:

https://github.com/nuprl/MultiPL-E/blob/main/datasets/originals/HumanEval_92_any_int.py

It's not clear to me if this should be fixed by changing the problem, removing the problem from MultiPL-E, or just left as something that fails.

Randl commented 5 months ago

I would read the problem as "the number is an integer" rather than "the type of variable is integer", i.e., I'd expect

assert candidate(3.0,4,7)==True

That, however, would mean the problem doesn't match the original HumanEval, so maybe it's better to just drop it.