Open ParadoxZW opened 1 year ago
What is the
max_seq_len
(ormax_position_embeddings
) of Mistral-7B-v0.1 when training?The official code says it is 128_000. (https://github.com/mistralai/mistral-src/blob/147c4e68279b90eb61b19bdea44e16f5539d5a5d/mistral/model.py#L201C69-L201C69)
The config file in huggingface says it is 32768. (https://huggingface.co/mistralai/Mistral-7B-v0.1/blob/main/config.json).
And the official blog mentions 16k.
And the paper claims an attention span of 131K tokens (Section 2 on "Architectural details" → "Sliding Window Attention").
What is the
max_seq_len
(ormax_position_embeddings
) of Mistral-7B-v0.1 when training?The official code says it is 128_000. (https://github.com/mistralai/mistral-src/blob/147c4e68279b90eb61b19bdea44e16f5539d5a5d/mistral/model.py#L201C69-L201C69)
The config file in huggingface says it is 32768. (https://huggingface.co/mistralai/Mistral-7B-v0.1/blob/main/config.json).
And the official blog mentions 16k.