Hello,
I have been training models with mamba (v1) and I'm enjoying it. I would like to use MuTransfer for Mamba. Should I just scale the width params (matrices dim and conv dim) or are there other constants that need to be scaled like in transformers attention_scores?
Hello, I have been training models with mamba (v1) and I'm enjoying it. I would like to use MuTransfer for Mamba. Should I just scale the width params (matrices dim and conv dim) or are there other constants that need to be scaled like in transformers attention_scores?