Open enesmsahin opened 2 weeks ago
Thanks, definitely will try and take a look at it!
load_state()
also fails when resuming from a checkpoint:
[rank3]: File "/opt/conda/envs/flux_cn/lib/python3.10/site-packages/deepspeed/runtime/state_dict_factory.py", line 168, in check_ckpt_list
[rank3]: assert len(self.ckpt_list) > 0
I am currently circumventing the issues by wrapping the load_state()
& save_state()
calls as follows:
acc_models = accelerator._models
accelerator._models = [model for model in acc_models if model.checkpoint_engine is not None]
# <load_state() or save_state() here>
accelerator._models = acc_models
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
src/accelerate/test_utils/scripts/external_deps/test_ds_multiple_model.py
Error Message:
This happens for frozen inference models. Their optimizer type is
DummyOptim
and as a return they are initialized withDeepSpeedZeRoOffload
. As a result,checkpoint_engine
is not assigned for them here and it isNone
: https://github.com/microsoft/DeepSpeed/blob/8cded575a94e296fee751072e862304676c95316/deepspeed/runtime/engine.py#L340Expected behavior
The
accelerator
should handle the cases when theself._models
contains frozen models here.cc: @muellerzr