nuprl / MultiPL-E

A multi-programming language benchmark for LLMs
https://nuprl.github.io/MultiPL-E/
Other
200 stars 38 forks source link

Citation for the LeetCode Dataset #140

Closed JJGO closed 5 months ago

JJGO commented 6 months ago

I'm curious what the citation for the LeetCode dataset is, and how the dataset was built

cassanof commented 6 months ago

Hello Jose!

The current leetcode dataset is a bit problematic. I have sourced the solutions from a huggingface repo that doesn't exist anymore. After careful analysis, I found that some solutions were incorrect, making evaluation quite flaky.

Anyways, the process I did to convert the solutions into a MultiPL-E eval is the following:

  1. I identified the root function for each solution by generating a control graph and picking the function with no dependents. If for some reason there were more than one function with no dependents, I discarded the whole item.
  2. I transplanted all helper functions and made them local functions to the root function.
  3. I generated unit tests for the root function using GPT-4

Another problem with the dataset is that most of these are in the training data of models.

I'm working on a better leetcode dataset on this branch: https://github.com/nuprl/MultiPL-E/tree/new-leetcode This is based on LeetCode contests solutions that have been verified. Sourced from: https://github.com/deepseek-ai/DeepSeek-Coder/blob/main/Evaluation/LeetCode/data/20240121-Jul.jsonl

These are all leetcode problems released after Jan 2024. You can currently use it if you'd like. The reason I haven't merged it yet is because I want to hand verify the solutions myself.

Let me know if you have other questions.