hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.78k stars 4.34k forks source link

how to resume vit_1d_tp2_pp2 model #3720

Open stonewjf opened 1 year ago

stonewjf commented 1 year ago

🐛 Describe the bug

I use this code to load_checkpoint to resume vit_1d_tp2_pp2 model,but got this error Traceback (most recent call last): File "train_with_trainer.py", line 143, in <module> load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 234, in load_checkpoint train_imagenet() File "train_with_trainer.py", line 96, in train_imagenet model_state = partition_pipeline_parallel_state_dict(model, model_state) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 133, in partition_pipeline_parallel_state_dict _send_state_dict(state_dict, gpc.get_next_global_rank(ParallelMode.PIPELINE), ParallelMode.PIPELINE) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 99, in _send_state_dict load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 234, in load_checkpoint state_tensor, state_size = dist.distributed_c10d._object_to_tensor(state_dict) TypeError: _object_to_tensor() missing 1 required positional argument: 'device'

Environment

No response

JThh commented 1 year ago

@stonewjf , can you please try this way from resuming from checkpoints?