nlpxucan / WizardLM

LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath
9.19k stars 712 forks source link

Add problem solving to evol-instruct #33

Open walking-octopus opened 1 year ago

walking-octopus commented 1 year ago

Description:

The current performance of Evol-Instruct on math, geometry, and physics problem solving is rather poor. To enhance the overall reasoning/basic math capabilities of WizardLM, I believe more high-school level physics, algebra, or geometry problems should be present within the dataset. GPT-4 seems to do mostly fine on them, so it seems doable, though being a smaller model, it would be quite interesting to see how far it can get.

This dataset I found didn't contain much physics questions in particular, which tracks well with hallucinated formulas and inability to reason step by step to find intermediary values.

nlpxucan commented 1 year ago

Thanks for your valuable suggestion, we found that the skills you mentioned improved when finetune with the larger llama model (i.e., 13B). We will continue to think about new ideas to improve these skills.

walking-octopus commented 1 year ago

Thank you for the timely response. I'd be interested to see how well the 13B model performed on these questions, which I can't do since I only have 8GB of RAM and a pretty weak CPU, only being able to play with the model on Gradio or through LLaMA.cpp.

Still, I find it fascinating to see how projects like this push the limits of what's possible with that low of a parameter count, prompting even the attention of Google and Microsoft (referring to Google's "we have no moat" memo and Microsoft's TinyStories experiment). I wonder if any meaningful results on this complex task can be achieved at just 7B without even training a model from scratch.

walking-octopus commented 1 year ago

The newly released WizardLM 13B, which dataset included more physics questions, had finally started forming coherent reasoning chains, correctly doing basic calculations, rearranging equations, and solves simple problems as well as gpt-3.5, which Guanaco 65B couldn't achieve.

However, interestingly, WizardLM 30B consistently hallucinates an incorrect reasoning chain, giving us snowballing hallucinations that end up with an incorrect answer. Perhaps this can give us some insight into effective scaling and training settings for a given dataset and foundation model.