Testing yarn on practical tasks.

jquesnelle / yarn

YaRN: Efficient Context Window Extension of Large Language Models

MIT License

1.25k stars 111 forks source link

Testing yarn on practical tasks. #12

Closed ChenxinAn-fdu closed 10 months ago

ChenxinAn-fdu commented 10 months ago

Hello, this is Chenxin.

I am sooo excited to see the first open-source model with more than 100k context !!! This is undoubtedly a very significant progress that the open-source community has made in LCLMs. I've noticed that the current version of Yarn only has PPL (Perplexity) experiments, which do not always correlate with practical long-context understanding tasks. I am glad😁 to help test llama2-yarn-128k on LEval but I do not have resources to do SFT based on llama2-yarn-128k. Would you mind providing a instruction-following version?

Thanks again for the great work!

bloc97 commented 10 months ago

Retrieval tasks are very solid across the entire context. We ran out of compute for the evals at the end, and are currently investigating how to SFT and instruction tune for a 128k context model... There aren't any 128k instruction tasks dataset...

LEval is also ~20k tokens if im not mistaken. It would not test 128k capabilities at all, but would be a good start!

ChenxinAn-fdu commented 10 months ago

Thank you for the reply. Testing on LEval may prove that it achieves improvements over the baseline Llama2-chat version. However, I completely agree that directly comparing 100k models with other 32k models on tasks of approximately 20 tokens is unfair. If you plan to test yarn on LEval in the future, I'd be pleased to assist. Keep up the good work!

By the way, Claude-100k performs quite well on tasks with various token lengths, and it might have multiple versions available, such as 4k, 16k, 64k, and 100k 🤔 ?

PeiqinSun commented 6 months ago

hi, @bloc97. In your response, there is a conclusion that Yarn-128k SFT must trained on an instruction dataset which has 128k token length in every sample. But in paper, authors show that 128k model can be trained on 64k datasets. Have any differences in SFT?