Open Anditty opened 1 year ago
I found if I use --deepspeed ds_config.json
option, then print(trainer.model.state_dict()['model.layers.30.mlp.gate_proj.weight'])
will print tensor([], device='cuda:0', dtype=torch.float16)
.
And It is mentioned in the README.md that FSDP full_shard mode is used, but FSDP and deepspeed should not be used at the same time.
I follow the step in README, but I get the empty state dict. Here is the code and the output: code:
output: tensor([[ 1.5984e-03, -1.6602e-02, -1.6460e-03, ..., -1.6632e-02, -1.9989e-02, 1.1383e-02], [ 9.5062e-03, 3.3356e-02, 5.6343e-03, ..., -3.6743e-02, -3.2074e-02, 2.6810e-02], [ 1.1917e-02, -2.1515e-02, -2.6352e-02, ..., 2.7328e-02, -4.0550e-03, 1.5320e-02], ..., [-2.8503e-02, 1.5316e-03, -1.8753e-02, ..., 2.9846e-02, -1.9440e-02, 2.6703e-02], [ 5.6505e-05, -4.5898e-02, 2.0660e-02, ..., -6.5689e-03, -3.2043e-02, 1.8005e-02], [-7.1106e-03, -7.1487e-03, -4.5624e-03, ..., 1.3138e-02, -4.3060e-02, -1.5869e-02]]) training tensor([], device='cuda:0', dtype=torch.float16) trained tensor([], device='cuda:0', dtype=torch.float16) saved