Thank you for your work! I've recently been reading the source code of SmartMoE and noticed that there is no implementation for transferring the optimizer state of experts in the update_expert_mapping function in layer.py. Could this potentially cause issues with gradient updates?
Thank you for your work! I've recently been reading the source code of SmartMoE and noticed that there is no implementation for transferring the optimizer state of experts in the
update_expert_mapping
function inlayer.py
. Could this potentially cause issues with gradient updates?