Make Mamba modular at least for tokenization and embedding

state-spaces / mamba

Mamba SSM architecture

Apache License 2.0

11.81k stars 979 forks source link

Make Mamba modular at least for tokenization and embedding #174

Closed hesingh closed 5 months ago

hesingh commented 5 months ago

If you see modular mistral.ai, they have a tokenizer.py. Modular code helps port Transformer apps to Mamba. I want to add new bos and eos for different data types, e.g., audio, image, video, etc.

tridao commented 5 months ago

Right now we just use a pretrained tokenizer (from GPT-NeoX) and nn.Embedding. I'm not sure what changes you're suggesting.

hesingh commented 5 months ago

See Table 1 in this paper. https://arxiv.org/pdf/2312.14125.pdf. Also see Figure 3.

tridao commented 5 months ago

Would you like to contribute? We welcome PRs.

hesingh commented 5 months ago

I'd be happy to once I understand the code better.

hesingh commented 5 months ago

How do I change Mamba to use pretrained GPT-Neox for text input but use another tokenizer for image or video, another for audio?

hesingh commented 5 months ago

If you read section 3.1 of the paper, the vocabulary uses 256 special tokens, 262k image and video tokens, and 4096 audio tokens. How do I set up this vocabulary?