Open RevolGMPHL opened 9 months ago
It was trained with seqlen=2k for apple to apple comparison with pythia, seems to extrapolate to around 3k context length but after that the quality is much worse.
If I train on a longer sequence training set, will it improve max token length? Does it have anything to do with the size of the model?
Yes training on longer context (e.g. 4k or 8k) should help improve max token length. I think this is a general property of most sequence models (e.g. Transformers should be similar).
How to understand table.2 in mamba's paper, which shows great extrapolate ablility?🤔 As your paper shows, mamba could train at seqlen = 10^3 and test at seqlen=10^6 with good performance.🤔
That extrapolation was for a simple synthetic task (induction head). For language modeling it remains to be seen.
Language models based on the transformer architecture can extrapolate beyond the context by adjusting the position encoding, which may also require fine-tuning training on longer documents. There are also technical solutions that mitigate the degradation of performance during context extrapolation by filtering the kv cache.
I would like to understand the model structure and design of the Mamba S6, and whether there are similar technical solutions suitable for context extrapolation. Thank you.
You can also finetune Mamba on long documents. Regarding "context extrapolation" without fine-tuning, the short answer is ... I don't know. The architecture is new different from Transformer, and there are still lots of interesting research questions.
Thanks very much.
I am currently not familiar with the inner details of the Mamba ssm module. May I ask if there are some parameters which shapes are related to preset context length?
There's no restriction, e.g. you can just pass in sequence of length 8k to finetune.
@tridao Does Mamba support passing state between multiple forward passes (or blocks of tokens) during training?
No that's not supported right now.
@tridao one more question about dataset processing during pretrain mamba-2.8 models.
As gpt3 paper said, "During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency.".
Did released mamba models use same packing tricks for datasets, thanks.
Yes we do exactly the same thing (which is now standard is several libraries): tokenize all documents, append "eos" token to the end of each document, concatenate all of them, the split into chunks of size 2048.
@tridao one more question please.
How to set Layers & Model dim for round 7B Mamba models, and are there design rules of model settings for model size scaling?
Thanks.
We just follow GPT3, e.g. 7B you can use 64 layers (2 Mamba layers has the same number of params as MLP + attn) and d_model = 4096.
We just follow GPT3, e.g. 7B you can use 64 layers (2 Mamba layers has the same number of params as MLP + attn) and d_model = 4096.
Thanks.
@tridao could you release mamba-1.4B intermediate checkpoint which is trained around 100B tokens?
I have trained mamba-1.4B from scratch using zh-en corpus. If checkpoint around 100B tokens is provided, I will check the metrics to validate the process.
Thanks
Unfortunately we only have the fully trained weights.
Unfortunately we only have the fully trained weights.
Thanks for your reply.
@tridao When scaling up max length for language modeling pretrain from sractch.
Could you please give us some advice about how to set hyperparameters like lr, warmup, global batch size, etc?
Thank you.
The paper describes the hyperparameters we used. When increasing sequence length we decrease batch size (i.e. keeping the total number of tokens in the batch the same), and keep other hparams the same. I'm not sure that's optimal but it's what I've been using.
@tridao what would it take to support passing state between forward passes? I can see it's possible to do this via inference_params, where's the catch?
@tridao what would it take to support passing state between forward passes? I can see it's possible to do this via inference_params, where's the catch?
inference_params supports moving the state forward by 1 step (i.e. recurrence). If you want to pass the states with length more than 1, you'd need to change the parallel scan (in selective_scan) to deal with that.
Mamba can be used as a module that can be drop-in replaced in some frameworks.
Megatron-LM is designed only for Transformer blocks. How can we integrate Mamba into it, could you give some advice, thanks.
Sorry to bother you both. @tridao @albertfgu
Instead of ParallelTransformerLayer
in Megatron-LM you'd want to replace that with a Mamba layer. Should work if you don't use TensorParallel / Pipeline Parallel in MegatronLM.
Instead of
ParallelTransformerLayer
in Megatron-LM you'd want to replace that with a Mamba layer. Should work if you don't use TensorParallel / Pipeline Parallel in MegatronLM.
Thanks a lot. Without TensorParallel / Pipeline Parallel, for model size scaling no need to use MegatronLM.
@tridao If there is no causal_conv1d_fn , how does the normal conv1d perform causally? Thanks
https://github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba_simple.py#L168
As the code says, it constructs nn.Conv1d with padding=3 (if conv has width 4), do the convolution, then remove the last 3 elements.
What is the max token length that this model can support? Can it support more than 10k?