Can this dataset test for chatgpt?(gpt 3.5?)

hendrycks / apps

APPS: Automated Programming Progress Standard (NeurIPS 2021)

MIT License

414 stars 55 forks source link

Can this dataset test for chatgpt?(gpt 3.5?) #28

Closed syccxdr closed 1 year ago

syccxdr commented 1 year ago

hi there! I am currently doing my thesis on Chatgpt. the main aim is to evaluate the program ability of chatgpt, I wonder if this dataset can be used in conversational form, like giving a prompt to gpt and returning the code to evaluate.

xksteven commented 1 year ago

It'll take a little bit of formatting to do but can be done. The main concern would be for data leakage. Has ChatGPT been trained on any of the examples we use in our training or eval set.

syccxdr commented 1 year ago

thanks，but what does formatting means? i don't exactly know this dataset are whether given in text from or otherwise

syccxdr commented 1 year ago

thankyou!

xksteven commented 1 year ago

Please see here for the formatting: https://huggingface.co/datasets/codeparrot/apps

syccxdr commented 1 year ago

I really appreciate it ! I downloaded the 1.3GB data set in readme.md and found that you have given the problem separately with text. Thank you for your efforts. My question is whether the code generated by chatgpt can be tested with the test sample of this dataset and generate test results, that is, whether the original solutions can be replaced.

xksteven commented 1 year ago

Here's how we normally test using the dataset.

Question text -> model -> generated code We then evaluate the generated code by providing in inputs into the program then check do the inputs into the program match the expected outputs.

For training we also have code written by humans that we use to update the model.

My understanding of what you're trying to do is Question text -> chat GPT -> text + code

You'll need to extract the code from the output. Afterwards run the code to see if it compiles and then pass in inputs followed by comparing the outputs to the correct outputs to gauge if it actually solved the problem correctly.

Let me know if I'm misunderstanding your aim.

syccxdr commented 1 year ago

I see , your description is correct, So you guys check the generated code by manually check? (human) what I am trying to do is : Question text -> chat GPT -> text + code , and the code whether can be evaluated by some automatic scripts in your dataset , otherwise I have to check by myself.

syccxdr commented 1 year ago

btw, the paper proposed two metrics: Test Case Average and Strict Accuracy. i was little bit confused about the definition about them.

xksteven commented 1 year ago

by manually check? (human)

No we do not manually check. If you look at the field input_output then you'll see that it encodes both the inputs into the programs and what the outputs should be from the programs.

Test case average

Each problem has N tests. So how many test cases does the code get correctly for each problem then average over all problems.

Strict Accuracy

A generated code is only considered correct if it can pass all tests. How many problems does the model get completely correct.

syccxdr commented 1 year ago

thank you !