Closed sentialx closed 10 months ago
Mamba is similar to a (causal) transformer in that it's a sequence to sequence mapping: given some input vectors [x_1, x_2, ...] it maps them to output vectors [y_1, y_2, ....] such that y_i only depends on x_1, ..., x_i. We train it in the same way you would train transformers with next word prediction objective (i.e. teacher forcing). During generation we do the same as transformers: given prompt tokens p_1, ..., p_k, the model predict the distribution of the next word, sample from that, and (conceptually) append that to the prompt, then repeat.
Mamba is just a change in architecture, not a change in training or inference procedure.
Thank you for clarification!
Hi, I couldn't find any info on possibility of exposure bias in the paper. In transformers during training, we would have a causal mask that always enforces correct tokens on the left when predicting next token which is called teacher forcing. So, inference for transformers is basically OOD because one erroneous token could totally break the entire generation. Does Mamba work in a similar way? Or maybe outputs are fed back to inputs during training somehow? Would it be possible to train non-autoregressive models such as diffusion?