microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.64k stars 4.04k forks source link

[BUG] Universal checkpoint conversion - "Cannot find layer_01* files in there" #5776

Open exnx opened 1 month ago

exnx commented 1 month ago

I am tryin to use the universal checkpoint conversion code, python ds_to_universal.py, but I get this error that can't find a layer number. I'm not sure why, but I am missing layer 01 and 16, my code just skips creating them when saving the checkpoint. Deepspeed ckpt conversion is expecting them, and therefore breaks. Does that sound familiar to anyone? Thanks in advance!

I am using GPT Neox codebase, and have Deepspeed 0.14.4 installed.

Error:

.../global_step4 seems a bogus DeepSpeed checkpoint folder: Cannot find layer_01* files in there.

Here are the files in my save directory:

bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt  layer_04-model_00-model_states.pt  layer_09-model_00-model_states.pt  layer_14-model_00-model_states.pt
configs                                         layer_05-model_00-model_states.pt  layer_10-model_00-model_states.pt  layer_15-model_00-model_states.pt
layer_00-model_00-model_states.pt               layer_06-model_00-model_states.pt  layer_11-model_00-model_states.pt  layer_17-model_00-model_states.pt
layer_02-model_00-model_states.pt               layer_07-model_00-model_states.pt  layer_12-model_00-model_states.pt  mp_rank_00_model_states.pt
layer_03-model_00-model_states.pt               layer_08-model_00-model_states.pt  layer_13-model_00-model_states.pt
xylian86 commented 1 month ago

@exnx What is your DeepSpeed configuration and can you share the stacktrace of the error?

exnx commented 7 hours ago

I found the issue, it's looking for layer_01, but that layer has no weights in my model so it's not saved.

So I had to hack the deepspeed library and remove some assertions that check for that layer_01 and instead looks for layer_02. DS looks for that layer_01 so that it can figure out what the model parallel size from that layer.