Any plans on 8192 context version?

openlm-research / open_llama

OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset

Apache License 2.0

7.39k stars 379 forks source link

Any plans on 8192 context version? #72

Open imoneoi opened 1 year ago

imoneoi commented 1 year ago

StarCoderPlus uses StarCoder + RefinedWeb dataset for training but with a longer context length. Are there plans to release a version with a longer context length, such as 8192?

Green-Sky commented 1 year ago

A simple finetune (lora is enough) for a stretched rope would be enough. see eg https://github.com/ggerganov/llama.cpp/discussions/1965

imoneoi commented 1 year ago

@Green-Sky We observed that fine-tuning may still cause performance degradation. It is better to have a native 8192 pretrained model.

Green-Sky commented 1 year ago

sounds like you are not using rope scaling. some rope scaling variants can get away without finetuning.

syzymon commented 1 year ago

You can try LongLLaMA which is a long-context (8192 and beyond) finetune of OpenLLaMA: https://github.com/CStanKonrad/long_llama https://huggingface.co/syzymon/long_llama_3b

It uses a different method than PI (see https://arxiv.org/abs//2307.03170 for details). There is no degradation on short context compared to the original 3B checkpoint and we are working to release larger models soon.

imoneoi commented 1 year ago

Thanks! How does it compare to native long context base models such as StarCoder 8192?

BTW, if we want the 8192 version of OpenLLaMA, maybe we need a JAX FlashAttention kernel like this?