Pretrained models are supporting larger and larger sequence lengths. In gemma's case this is a particular nasty gotcha, as very few people have the compute resources to actually train on 8000 token long sequences.
I think it might be the more user friendly approach to lower our default sequence length to 1024. This won't prohibit users from setting a longer sequence length, but it will lessen the unpleasant gotcha of suddenly using a ton of VRAM when training.
We should still almost always document setting the sequence length in our code examples, as it's something the user generally should think about when fine-tuning or generating.
Pretrained models are supporting larger and larger sequence lengths. In gemma's case this is a particular nasty gotcha, as very few people have the compute resources to actually train on 8000 token long sequences.
I think it might be the more user friendly approach to lower our default sequence length to 1024. This won't prohibit users from setting a longer sequence length, but it will lessen the unpleasant gotcha of suddenly using a ton of VRAM when training.
We should still almost always document setting the sequence length in our code examples, as it's something the user generally should think about when fine-tuning or generating.