nuprl / MultiPL-E

A multi-programming language benchmark for LLMs
https://nuprl.github.io/MultiPL-E/
Other
200 stars 38 forks source link

Performance issue with Julia/R languages #143

Closed Devy99 closed 4 months ago

Devy99 commented 4 months ago

Hello!

First, I would like to thank you for your time and effort invested in developing this tool. I am writing to report an issue that I have encountered while evaluating Julia and R code on the HumanEval dataset. I have noticed that these two languages are very "expensive" in terms of the resources required to run the test cases.

In particular, it appears that increasing the number of functions to test per problem also increases CPU utilization, as if it launches a new process for each function to test. To avoid this problem, I am running the docker container with the option "--cpus 6". However, this leads to lots of timeouts, significantly impacting the final pass rate.

I also experimented with other languages, such as Lua, but found no specific issues.

Do you have any clue or suggestion of how can I fix this problem?

Thanks in advance!

arjunguha commented 4 months ago

Thanks for the report. Can you tell me a little about the hardware you're running on?

Devy99 commented 4 months ago

Sure, below the details of the server hardware:

OS: Ubuntu

CPU:

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
CPU(s):                             256
On-line CPU(s) list:                0-255
Thread(s) per core:                 2
Core(s) per socket:                 64
Socket(s):                          2
NUMA node(s):                       8
Vendor ID:                          AuthenticAMD
CPU family:                         23
Model:                              49
Model name:                         AMD EPYC 7742 64-Core Processor

Memory:

MemTotal:       528 GB
arjunguha commented 4 months ago

this is some solid hardware. I'm surprised it doesn't "just work". Could you clarify what you meant by this:

"increasing the number of functions to test per problem"

Do you mean increasing the number of generations?

Devy99 commented 4 months ago

Exactly. Right now I am generating 200 completions for each problem and then running the test cases.

Looking at some example with julia code, seems like you also experienced some Timeout, even if in a smaller scale. Could it be related to how it was implemented for Julia? For example, in Lua evaluation is far faster and with no timeout.

arjunguha commented 4 months ago

Agreed. The first make_a_pile Timeout really should be a StackOverflow. (I get the error in 2 seconds on replit.com.)

Would you try setting the --max-workers flag:

https://github.com/nuprl/MultiPL-E/blob/main/evaluation/src/main.py#L109

The file above is the entry point to the container:

https://github.com/nuprl/MultiPL-E/blob/main/evaluation/Dockerfile#L87

So, you should be able to pass --max-workers using the container from your CLI.

I think the issue is that the default number of --max-workers is too high for Julia:

https://github.com/nuprl/MultiPL-E/blob/main/evaluation/src/main.py#L129

Perhaps try half the number of allocated cores. (I think I recall doing this for the original paper.)

Devy99 commented 4 months ago

Passing --max-workers fixed the problem. Thanks!