Resuming training size mismatch

afiaka87 commented 3 years ago

getting size mismatches on the entire checkpoint. This sort of thing.

        size mismatch for transformer.layers.blocks.12.g.net.fn.fn.net.0.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([8192, 1024]).
        size mismatch for transformer.layers.blocks.12.g.net.fn.fn.net.3.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for transformer.layers.blocks.13.f.net.fn.fn.to_qkv.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1536, 1024]).
        size mismatch for transformer.layers.blocks.13.f.net.fn.fn.to_out.0.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
        size mismatch for transformer.layers.blocks.13.g.net.fn.fn.net.0.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([8192, 1024]).
        size mismatch for transformer.layers.blocks.13.g.net.fn.fn.net.3.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for transformer.layers.blocks.14.f.net.fn.fn.to_qkv.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1536, 1024]).
        size mismatch for transformer.layers.blocks.14.f.net.fn.fn.to_out.0.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
        size mismatch for transformer.layers.blocks.14.g.net.fn.fn.net.0.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([8192, 1024]).
        size mismatch for transformer.layers.blocks.14.g.net.fn.fn.net.3.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for transformer.layers.blocks.15.f.net.fn.fn.to_qkv.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1536, 1024]).
        size mismatch for transformer.layers.blocks.15.f.net.fn.fn.to_out.0.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1024, 512]).
        size mismatch for transformer.layers.blocks.15.g.net.fn.fn.net.0.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([8192, 1024]).
        size mismatch for transformer.layers.blocks.15.g.net.fn.fn.net.3.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).
        size mismatch for to_logits.1.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([50688, 1024]).
Killing subprocess 676

whenever I resume from a checkpoint. Were the checkpoint keys change recently?

afiaka87 commented 3 years ago

seems my checkpoint wasn't being saved properly. Maybe this is related to deepspeed requiring you to use their methods to load and save pytorch models?