Convergence testing best practices

pytorch / torchtitan

A native PyTorch Library for large model training

BSD 3-Clause "New" or "Revised" License

2.65k stars 206 forks source link

Convergence testing best practices #648

Open gnadathur opened 1 month ago

gnadathur commented 1 month ago

Config for convergence testing is unclear. Create a best practice config in torchtitan for convergence testing.

lessw2020 commented 1 month ago

Two core aspects here: 1 - ensure determinism (assuming these runs are to compare various dtypes, etc). utils.set_determinism should handle this but need to confirm cublas setting has landed from Wei. 2 - LR warmup is not being properly set. Basic schedule optimization is to pre-run while tracking L1 gradient norms, and feed that to Aaron's refined scheduler based on: https://arxiv.org/abs/2310.07831

awgu commented 1 month ago

@lessw2020 I am not sure if convergence testing should have determinism enabled because ultimately that will not be the setting you train with (and care to converge).

I think determinism is super useful for debugging and understand where numeric differences come from, but I think that for convergence testing, running with "prod" settings makes more sense to me.

fduwjj commented 1 month ago

I guess for torchTitan, we really cannot call it "convergence" right? since we never wait the training to finish and a full convergence is achieved and we tested on test benchmark to validate the accuracy of the model. To me that is what "convergence" means. Currently we only show "Hey the loss curve is going down".

tianyu-l commented 1 month ago

we really cannot call it "convergence" right?

Technically yes. Maybe we can call them "loss converging" tests. I don't think it would cause confusion, so it should be fine as long as we don't say wrong things (convergence is wrong).

wconstab commented 2 weeks ago

Action: Document what are the guidelines for doing runs that are used to show a 'config' is working correctly.

what settings to use? (seed? non-determinism)
how many steps? ...

And how can a user reproduce our results? (what are our 'results' - a loss curve? what are the configs and settings they need to use)

-> we should provide a table (in README or nearby) for showing the configs we tested and the loss curves, benchmark results etc)

From @wz337: we can put the README in this folder so user would see it when they click into the train_configs. https://github.com/pytorch/torchtitan/tree/main/train_configs