Closed JJGO closed 5 months ago
Hello Jose!
The current leetcode dataset is a bit problematic. I have sourced the solutions from a huggingface repo that doesn't exist anymore. After careful analysis, I found that some solutions were incorrect, making evaluation quite flaky.
Anyways, the process I did to convert the solutions into a MultiPL-E eval is the following:
Another problem with the dataset is that most of these are in the training data of models.
I'm working on a better leetcode dataset on this branch: https://github.com/nuprl/MultiPL-E/tree/new-leetcode This is based on LeetCode contests solutions that have been verified. Sourced from: https://github.com/deepseek-ai/DeepSeek-Coder/blob/main/Evaluation/LeetCode/data/20240121-Jul.jsonl
These are all leetcode problems released after Jan 2024. You can currently use it if you'd like. The reason I haven't merged it yet is because I want to hand verify the solutions myself.
Let me know if you have other questions.
I'm curious what the citation for the LeetCode dataset is, and how the dataset was built