Closed Devy99 closed 4 months ago
Thanks for the report. Can you tell me a little about the hardware you're running on?
Sure, below the details of the server hardware:
OS: Ubuntu
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7742 64-Core Processor
Memory:
MemTotal: 528 GB
this is some solid hardware. I'm surprised it doesn't "just work". Could you clarify what you meant by this:
"increasing the number of functions to test per problem"
Do you mean increasing the number of generations?
Exactly. Right now I am generating 200 completions for each problem and then running the test cases.
Looking at some example with julia code, seems like you also experienced some Timeout, even if in a smaller scale. Could it be related to how it was implemented for Julia? For example, in Lua evaluation is far faster and with no timeout.
Agreed. The first make_a_pile
Timeout really should be a StackOverflow. (I get the error in 2 seconds on replit.com.)
Would you try setting the --max-workers
flag:
https://github.com/nuprl/MultiPL-E/blob/main/evaluation/src/main.py#L109
The file above is the entry point to the container:
https://github.com/nuprl/MultiPL-E/blob/main/evaluation/Dockerfile#L87
So, you should be able to pass --max-workers
using the container from your CLI.
I think the issue is that the default number of --max-workers
is too high for Julia:
https://github.com/nuprl/MultiPL-E/blob/main/evaluation/src/main.py#L129
Perhaps try half the number of allocated cores. (I think I recall doing this for the original paper.)
Passing --max-workers fixed the problem. Thanks!
Hello!
First, I would like to thank you for your time and effort invested in developing this tool. I am writing to report an issue that I have encountered while evaluating Julia and R code on the HumanEval dataset. I have noticed that these two languages are very "expensive" in terms of the resources required to run the test cases.
In particular, it appears that increasing the number of functions to test per problem also increases CPU utilization, as if it launches a new process for each function to test. To avoid this problem, I am running the docker container with the option "--cpus 6". However, this leads to lots of timeouts, significantly impacting the final pass rate.
I also experimented with other languages, such as Lua, but found no specific issues.
Do you have any clue or suggestion of how can I fix this problem?
Thanks in advance!