I guess mamba.step could be deleted if selective_scan_fn can accept ssm_state as an input param.

agiwave commented 5 months ago

Maybe there are some benifit below: 1, The code could be simplier. 2, The inference could be faster. 3, The inference can accept multi-tokens in this way. There are some reference codes here. https://github.com/agiwave/Mamba/blob/main/Mamba.py

tridao commented 5 months ago

Yep, in some sense step is the specialized version of forward that accepts ssm_state and only move by 1 step.

agiwave commented 5 months ago

Mamba is really a great work on model size control. On this point, there's another thinking. The mamba vocab embedding size is 502801024(370m) = about 50m parameters. But it really need d_model=1024 to encode a word? My assumption is: The vocab-dim can be reduced to 128(or even 64) to encode a word. The latent_dim(d_model) can still be 1024 or more to contain more info beyond a word. So, could have in_proj and out_proj to mapping between vocab_dim and d_model. Maybe it also can give us: 1, Sharing vocab_embedding between diff-size models: 370m, 790m ....... 2, Vocab_embbeding size will be cut from 502801024 to 50280128, or maybe 250000128 will be better. 3, The train could be a little fast because the out_proj can focus on the word encoding info and remove those high-level encoding info(noise for training loss on the last step). I can't play LLM for resource limit reason. So, this can only be an assumption. Free to ignore it：）

agiwave commented 4 months ago

Maybe that version above can help a lot on this point. I don't know whether our owner will take it into consideration. So, i tried it by myself:).

AnaRhisT94 commented 1 month ago

Ok I made it work finally. https://github.com/state-spaces/mamba/pull/477 (With GPU support)

state-spaces / mamba

I guess mamba.step could be deleted if selective_scan_fn can accept ssm_state as an input param. #233