checkpoint.model_weights_only Doesn't makes any difference

pytorch / torchtitan

A native PyTorch Library for large model training

BSD 3-Clause "New" or "Revised" License

1.28k stars 115 forks source link

checkpoint.model_weights_only Doesn't makes any difference #336

Closed TJ-Solergibert closed 1 month ago

TJ-Solergibert commented 1 month ago

Hi,

With the llama3-8B config either setting in the .toml file model_weights_only = true // false or via the --checkpoint.model_weights_only flag produces exactly the same checkpoints, same size even across multiple runs.

Also HF Llama3-8B checkpoints are ~16 GB compared to the 90GB it's producing. Running with DP = 4, PP = 1 & TP = 1.

fegin commented 1 month ago

model_weights_only will only be effect for the very last checkpoint, after all training is done. We could not do model_weights_only for all other checkpoints as those checkpoints can also be used for fault tolerance.