I trained a tiny Llama for 1500 steps in this PR and a reference repo (before the refactor) and compared the losses. They are all identical down to the last decimal place. And it's passed and failed all the tests that it did before the refactoring [link].
I trained a tiny Llama for 1500 steps in this PR and a reference repo (before the refactor) and compared the losses. They are all identical down to the last decimal place. And it's passed and failed all the tests that it did before the refactoring [link].
Config of the trained model
Command:
FI_PROVIDER=efa USE_FAST=1 CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --rdzv-backend=c10d --nproc_per_node=4 run_train.py --config-file examples/config_phuc_tiny_llama.yaml