I use this code to load_checkpoint to resume vit_1d_tp2_pp2 model,but got this error
Traceback (most recent call last): File "train_with_trainer.py", line 143, in <module> load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 234, in load_checkpoint train_imagenet() File "train_with_trainer.py", line 96, in train_imagenet model_state = partition_pipeline_parallel_state_dict(model, model_state) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 133, in partition_pipeline_parallel_state_dict _send_state_dict(state_dict, gpc.get_next_global_rank(ParallelMode.PIPELINE), ParallelMode.PIPELINE) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 99, in _send_state_dict load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 234, in load_checkpoint state_tensor, state_size = dist.distributed_c10d._object_to_tensor(state_dict) TypeError: _object_to_tensor() missing 1 required positional argument: 'device'
🐛 Describe the bug
I use this code to load_checkpoint to resume vit_1d_tp2_pp2 model,but got this error
Traceback (most recent call last): File "train_with_trainer.py", line 143, in <module> load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 234, in load_checkpoint train_imagenet() File "train_with_trainer.py", line 96, in train_imagenet model_state = partition_pipeline_parallel_state_dict(model, model_state) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 133, in partition_pipeline_parallel_state_dict _send_state_dict(state_dict, gpc.get_next_global_rank(ParallelMode.PIPELINE), ParallelMode.PIPELINE) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 99, in _send_state_dict load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 234, in load_checkpoint state_tensor, state_size = dist.distributed_c10d._object_to_tensor(state_dict) TypeError: _object_to_tensor() missing 1 required positional argument: 'device'
Environment
No response