Open gnadathur opened 2 weeks ago
Two core aspects here: 1 - ensure determinism (assuming these runs are to compare various dtypes, etc). utils.set_determinism should handle this but need to confirm cublas setting has landed from Wei. 2 - LR warmup is not being properly set. Basic schedule optimization is to pre-run while tracking L1 gradient norms, and feed that to Aaron's refined scheduler based on: https://arxiv.org/abs/2310.07831
@lessw2020 I am not sure if convergence testing should have determinism enabled because ultimately that will not be the setting you train with (and care to converge).
I think determinism is super useful for debugging and understand where numeric differences come from, but I think that for convergence testing, running with "prod" settings makes more sense to me.
I guess for torchTitan, we really cannot call it "convergence" right? since we never wait the training to finish and a full convergence is achieved and we tested on test benchmark to validate the accuracy of the model. To me that is what "convergence" means. Currently we only show "Hey the loss curve is going down".
we really cannot call it "convergence" right?
Technically yes. Maybe we can call them "loss converging" tests. I don't think it would cause confusion, so it should be fine as long as we don't say wrong things (convergence is wrong).
Action: Document what are the guidelines for doing runs that are used to show a 'config' is working correctly.
And how can a user reproduce our results? (what are our 'results' - a loss curve? what are the configs and settings they need to use)
-> we should provide a table (in README or nearby) for showing the configs we tested and the loss curves, benchmark results etc)
From @wz337: we can put the README in this folder so user would see it when they click into the train_configs. https://github.com/pytorch/torchtitan/tree/main/train_configs
Config for convergence testing is unclear. Create a best practice config in torchtitan for convergence testing.