[BUG] Universal checkpoint conversion - "Cannot find layer_01* files in there"

exnx commented 3 months ago

I am tryin to use the universal checkpoint conversion code, python ds_to_universal.py, but I get this error that can't find a layer number. I'm not sure why, but I am missing layer 01 and 16, my code just skips creating them when saving the checkpoint. Deepspeed ckpt conversion is expecting them, and therefore breaks. Does that sound familiar to anyone? Thanks in advance!

I am using GPT Neox codebase, and have Deepspeed 0.14.4 installed.

Error:

.../global_step4 seems a bogus DeepSpeed checkpoint folder: Cannot find layer_01* files in there.

Here are the files in my save directory:

bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt  layer_04-model_00-model_states.pt  layer_09-model_00-model_states.pt  layer_14-model_00-model_states.pt
configs                                         layer_05-model_00-model_states.pt  layer_10-model_00-model_states.pt  layer_15-model_00-model_states.pt
layer_00-model_00-model_states.pt               layer_06-model_00-model_states.pt  layer_11-model_00-model_states.pt  layer_17-model_00-model_states.pt
layer_02-model_00-model_states.pt               layer_07-model_00-model_states.pt  layer_12-model_00-model_states.pt  mp_rank_00_model_states.pt
layer_03-model_00-model_states.pt               layer_08-model_00-model_states.pt  layer_13-model_00-model_states.pt

xylian86 commented 3 months ago

@exnx What is your DeepSpeed configuration and can you share the stacktrace of the error?

exnx commented 1 month ago

I found the issue, it's looking for layer_01, but that layer has no weights in my model so it's not saved.

So I had to hack the deepspeed library and remove some assertions that check for that layer_01 and instead looks for layer_02. DS looks for that layer_01 so that it can figure out what the model parallel size from that layer.

tjruwase commented 1 month ago

@exnx, thanks for debugging this issue. Your analysis is correct. The purpose of that assertion is to confirm that existence of at least one layer_* file if using pipeline parallelism. There is nothing special about layer_01, it was just a convenient choice for the model used during the development. For example, a more robust (but inefficient) validation would be to check that _get_layer_keys() is not empty.

microsoft / DeepSpeed

[BUG] Universal checkpoint conversion - "Cannot find layer_01* files in there" #5776