Excuse me, how is your reproduction in starcoder/wizardCoder? Can the data achieve the effect of humaneval 57 claimed in the paper?

ahong007007 commented 1 year ago

thank you very much!

nickrosh commented 1 year ago

I suppose a clarification is that it is a recreation of the dataset used to train starcoder/WizardCoder. The foundational model for WizardCoder, Starcoder is 15B. I only had the resources to fine tune Replit's 2.7B code model. ReplitLM has a humaneval score of 21, and I was able to get it to 31 with the uncleaned version of this dataset. Starcoder out of the box already has a score of 34.

There is also the question of data quality. I did the evolution process, but there is no information on evolution pruning in the WizardCoder paper, only in the WizardLM paper. So even if I train StarCoder on this initial unpruned dataset, it may not hit 57.

Symbolk commented 1 year ago

I suppose a clarification is that it is a recreation of the dataset used to train starcoder/WizardCoder. The foundational model for WizardCoder, Starcoder is 15B. I only had the resources to fine tune Replit's 2.7B code model. ReplitLM has a humaneval score of 21, and I was able to get it to 31 with the uncleaned version of this dataset. Starcoder out of the box already has a score of 34.

There is also the question of data quality. I did the evolution process, but there is no information on evolution pruning in the WizardCoder paper, only in the WizardLM paper. So even if I train StarCoder on this initial unpruned dataset, it may not hit 57.

Could the difference comes from the drastic difference between GPT-3.5 and GPT-4? Given that for researchers inside Microsoft, the GPT-4 API should be available via Azure back in Feb. or Mar. this year, I guess there is no reason to use GPT-3.5 instead of GPT-4?

FIY：As the just leaked GPT-4 detail shows (threadreaderapp.com/thread/1678545170508267522.html), GPT-4 is trained 2 epochs for text-based data and 4 for code-based data!

nickrosh commented 1 year ago

The authors did not specify whether they used GPT3.5 or GPT4. It cost me around $100 using 3.5-turbo. If one was to replicate my process with GPT-4, it would cost around $3,000. If one was to also include the LLM information gain check that they did, which I left out, then it would be closer to $4,500.

I could see a performance uplift with using GPT-4. If someone wants to subsidize me, I would be happy to make the dataset!

ahong007007 commented 1 year ago

Please tell me, do you have a GPT-4 account or permission? What kind of effect can the data generated by GPT-4 achieve now? I think we can work together

nickrosh commented 1 year ago

I will have access to the GPT-4 API in a couple weeks.

epinnock commented 1 year ago

Hi @nickrosh thanks for putting this dataset together I had a quick question what is the difference between the 8k dataset and the full 80k is it just count or has the 8k been filtered in anyway? Also do you have any ongoing work to filter this dataset I was looking to augment this dataset by having gpt3.5 assess the quality and give a rating it seems to work fairly well with the 16k gpt3.5 model. wanted to check in with your first before proceeding on a larger scale

nickrosh / evol-teacher

Excuse me, how is your reproduction in starcoder/wizardCoder? Can the data achieve the effect of humaneval 57 claimed in the paper? #1