Closed moussaKam closed 2 days ago
Hi @moussaKam,
I'll open a PR with a fix soon, but meanwhile you can try applying the following commit yourself
https://github.com/swiss-ai/nanotron/commit/664c09aa48204b8a45756e15fac9cc6bf0b38ccf
Toni
Thanks @TJ-Solergibert
Hi @moussaKam,
I'll open a PR with a fix soon, but meanwhile you can try applying the following commit yourself
Toni
If I have already trained a model for one week, and come across this issue, is it still possible to resume? In this way, I don't need to retrain the model for one week.
Hi @moussaKam, I'll open a PR with a fix soon, but meanwhile you can try applying the following commit yourself swiss-ai@664c09a Toni
If I have already trained a model for one week, and come across this issue, is it still possible to resume? In this way, I don't need to retrain the model for one week.
Just solved it by hardcode. NVM.
Hi @moussaKam, I'll open a PR with a fix soon, but meanwhile you can try applying the following commit yourself swiss-ai@664c09a Toni
If I have already trained a model for one week, and come across this issue, is it still possible to resume? In this way, I don't need to retrain the model for one week.
Just solved it by hardcode. NVM.
Nice! I was going to suggest you training with the fixed PR for a single iteration, storing 1 ckpt after a single iteration and then coping from the SINGLE original .pt checkpoint the values to the new .pt files. You can't directly duplicate the original file because most likely they will have a different size/shape in each and every PP rank.
Hi @moussaKam, I'll open a PR with a fix soon, but meanwhile you can try applying the following commit yourself swiss-ai@664c09a Toni
If I have already trained a model for one week, and come across this issue, is it still possible to resume? In this way, I don't need to retrain the model for one week.
Just solved it by hardcode. NVM.
Nice! I was going to suggest you training with the fixed PR for a single iteration, storing 1 ckpt after a single iteration and then coping from the SINGLE original .pt checkpoint the values to the new .pt files. You can't directly duplicate the original file because most likely they will have a different size/shape in each and every PP rank.
Thanks. What I did is to hardcode the current lr scheduler.
if self.init_checkpoint_path is not None:
try:
load_lr_scheduler(
lr_scheduler=self.lr_scheduler,
root_folder=self.init_checkpoint_path,
)
except (IndexError, RuntimeError) as e:
logger.warning(f"Failed to load lr_scheduler state: {e}. Initializing new scheduler.")
# Calculate the correct learning rate based on progress
checkpoint_metadata = load_meta(
parallel_context=self.parallel_context,
root_folder=self.init_checkpoint_path
)
assert isinstance(checkpoint_metadata.metas, TrainingMetadata)
current_step = checkpoint_metadata.metas.last_train_step
# Fast-forward the scheduler to current step
for _ in range(current_step):
self.lr_scheduler.step()`
It at least now solved the problems. I will use the new codebase with your PR for future training.
Thank you for opening the issue @moussaKam !
The issue happens because LambdaLR
creates as many lr_lambdas
as we have param_groups
And whereas we only had a single param_group containing all params before, we recently opted to have a single param per param_group
which is what created this issue (every process has a different number of params = param_groups = lr_lambdas )
Nonetheless @alexchen4ai fixing this is easy as you can just load the lr_lambdas
for a single param_group and duplicate it (using deepcopies). Assuming ofc that you want all your parameters to follow the same lr scheduler which is the default in nanotron!
After running the toy example I run it again to resume training and I'm getting an error only if PP > 1
Here's the config:
Here's the error: