Open tiankongdeguiji opened 4 days ago
Hi, @henrylhtsang @IvanKobzarev @joshuadeng @PaulZhang12 can you see this problem? I think it may related to the code here, https://github.com/pytorch/torchrec/blob/release/v0.8.0/torchrec/distributed/batched_embedding_kernel.py#L472
Hi, @sarckk @TroyGarden can you see this problem?
We can reproduce this problem using the following command:
torchrun --master_addr=127.0.0.1 --master_port=1234 --nnodes=1 --nproc-per-node=1 --node_rank=0 test_optimizer_state.py --sharding_type $SHARDING_TYPE
, and use the enviromenttorchrec==0.8.0+cu121, torch==2.4.0+cu121, fbgemm-gpu==0.8.0+cu121
when SHARDING_TYPE=row_wise, it will print
when SHARDING_TYPE=data_parallel, it will print
xxx.weight.table_0.momentum1 -> xxx.weight.exp_avg,xxx.weight.table_0.exp_avg_sq -> xxx.weight.exp_avg_sq
We may load the model to continue training on clusters with different scales, which can lead to different Sharding Plans, and consequently result in the optimizer's parameters not being loaded correctly.
test_optimizer_state.py