Expirement with using RepSet of 196k for EvolInstruct 1k - Githubissues

nlpxucan / WizardLM

LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath

9.19k stars 712 forks source link

Expirement with using RepSet of 196k for EvolInstruct 1k #124

Open walking-octopus opened 1 year ago

walking-octopus commented 1 year ago

The new WizardLM 13B v1.1 was fine-tuned with a 1k instruct dataset, similar to the LIMA paper.

I wonder if making the 1k dataset more representative of the initial 100k distribution can boost performance on some tasks.

Google had an interesting paper, Extracting representative subset from extensive text data for training pre-trained language models, which they tried to apply to Colossal Cleaned Crawled Corpus (C4) to see if it improved performance of LLMs pretrained on less tokens, which it did.

Perhaps this can be of use for diverse instruction alignment too?

ChiYeungLaw commented 1 year ago

Thank you for your suggestions. We will read this paper.