state-spaces / mamba

Mamba SSM architecture
Apache License 2.0
12.52k stars 1.05k forks source link

About max token length #7

Open RevolGMPHL opened 9 months ago

RevolGMPHL commented 9 months ago

What is the max token length that this model can support? Can it support more than 10k?

tridao commented 9 months ago

It was trained with seqlen=2k for apple to apple comparison with pythia, seems to extrapolate to around 3k context length but after that the quality is much worse.

RevolGMPHL commented 9 months ago

If I train on a longer sequence training set, will it improve max token length? Does it have anything to do with the size of the model?

tridao commented 9 months ago

Yes training on longer context (e.g. 4k or 8k) should help improve max token length. I think this is a general property of most sequence models (e.g. Transformers should be similar).

EricLina commented 9 months ago

How to understand table.2 in mamba's paper, which shows great extrapolate ablility?🤔 As your paper shows, mamba could train at seqlen = 10^3 and test at seqlen=10^6 with good performance.🤔 image

tridao commented 9 months ago

That extrapolation was for a simple synthetic task (induction head). For language modeling it remains to be seen.

ftgreat commented 8 months ago

Language models based on the transformer architecture can extrapolate beyond the context by adjusting the position encoding, which may also require fine-tuning training on longer documents. There are also technical solutions that mitigate the degradation of performance during context extrapolation by filtering the kv cache.

I would like to understand the model structure and design of the Mamba S6, and whether there are similar technical solutions suitable for context extrapolation. Thank you.

tridao commented 8 months ago

You can also finetune Mamba on long documents. Regarding "context extrapolation" without fine-tuning, the short answer is ... I don't know. The architecture is new different from Transformer, and there are still lots of interesting research questions.

ftgreat commented 8 months ago

Thanks very much.

I am currently not familiar with the inner details of the Mamba ssm module. May I ask if there are some parameters which shapes are related to preset context length?

tridao commented 8 months ago

There's no restriction, e.g. you can just pass in sequence of length 8k to finetune.

sentialx commented 8 months ago

@tridao Does Mamba support passing state between multiple forward passes (or blocks of tokens) during training?

tridao commented 8 months ago

No that's not supported right now.

ftgreat commented 8 months ago

@tridao one more question about dataset processing during pretrain mamba-2.8 models.

As gpt3 paper said, "During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency.".

Did released mamba models use same packing tricks for datasets, thanks.

image

tridao commented 8 months ago

Yes we do exactly the same thing (which is now standard is several libraries): tokenize all documents, append "eos" token to the end of each document, concatenate all of them, the split into chunks of size 2048.

ftgreat commented 8 months ago

@tridao one more question please.

How to set Layers & Model dim for round 7B Mamba models, and are there design rules of model settings for model size scaling?

Thanks.

tridao commented 8 months ago

We just follow GPT3, e.g. 7B you can use 64 layers (2 Mamba layers has the same number of params as MLP + attn) and d_model = 4096.

ftgreat commented 8 months ago

We just follow GPT3, e.g. 7B you can use 64 layers (2 Mamba layers has the same number of params as MLP + attn) and d_model = 4096.

Thanks.

ftgreat commented 8 months ago

@tridao could you release mamba-1.4B intermediate checkpoint which is trained around 100B tokens?

I have trained mamba-1.4B from scratch using zh-en corpus. If checkpoint around 100B tokens is provided, I will check the metrics to validate the process.

Thanks

tridao commented 8 months ago

Unfortunately we only have the fully trained weights.

ftgreat commented 8 months ago

Unfortunately we only have the fully trained weights.

Thanks for your reply.

ftgreat commented 8 months ago

@tridao When scaling up max length for language modeling pretrain from sractch.

Could you please give us some advice about how to set hyperparameters like lr, warmup, global batch size, etc?

Thank you.

tridao commented 8 months ago

The paper describes the hyperparameters we used. When increasing sequence length we decrease batch size (i.e. keeping the total number of tokens in the batch the same), and keep other hparams the same. I'm not sure that's optimal but it's what I've been using.

sentialx commented 8 months ago

@tridao what would it take to support passing state between forward passes? I can see it's possible to do this via inference_params, where's the catch?

tridao commented 8 months ago

@tridao what would it take to support passing state between forward passes? I can see it's possible to do this via inference_params, where's the catch?

inference_params supports moving the state forward by 1 step (i.e. recurrence). If you want to pass the states with length more than 1, you'd need to change the parallel scan (in selective_scan) to deal with that.

ftgreat commented 8 months ago

Mamba can be used as a module that can be drop-in replaced in some frameworks.

Megatron-LM is designed only for Transformer blocks. How can we integrate Mamba into it, could you give some advice, thanks.

Sorry to bother you both. @tridao @albertfgu

tridao commented 8 months ago

Instead of ParallelTransformerLayer in Megatron-LM you'd want to replace that with a Mamba layer. Should work if you don't use TensorParallel / Pipeline Parallel in MegatronLM.

ftgreat commented 8 months ago

Instead of ParallelTransformerLayer in Megatron-LM you'd want to replace that with a Mamba layer. Should work if you don't use TensorParallel / Pipeline Parallel in MegatronLM.

Thanks a lot. Without TensorParallel / Pipeline Parallel, for model size scaling no need to use MegatronLM.

ftgreat commented 7 months ago

@tridao If there is no causal_conv1d_fn , how does the normal conv1d perform causally? Thanks

https://github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba_simple.py#L168

tridao commented 7 months ago

As the code says, it constructs nn.Conv1d with padding=3 (if conv has width 4), do the convolution, then remove the last 3 elements.