When I looked at the training code (train.py), you used the AdamW optimizer and specified the betas parameter as (0.9, 0.98). I wonder what is your reason for choosing this parameter? Have you tried other parameters?
I have also been training a Mamba-based model recently and your work has inspired me a lot. Would you mind sharing your opinions?
Hi, Thank you for your exellent work!
When I looked at the training code (train.py), you used the AdamW optimizer and specified the betas parameter as (0.9, 0.98). I wonder what is your reason for choosing this parameter? Have you tried other parameters?
I have also been training a Mamba-based model recently and your work has inspired me a lot. Would you mind sharing your opinions?