Open Pclanglais opened 1 month ago
Hey thanks for opening the issue, can you add the error message that you get and the log ?
Here they are:
(switched to .txt due to github constraints)
Hi! About the "0 steps remaining" in this issue: here https://github.com/huggingface/nanotron/blob/97c13b0d45212eab8443f6ee3c496934f2f5bbaa/src/nanotron/helpers.py#L694-L698 there seems to be a bug. It returns that 0 steps are remaining when the current training step is larger than the first step of the stage. However, the stage can be not finished yet: in our case, the total number of steps in the stage is 100,000 and we are trying to restart from step 42501.
cc @zzhhjjj maybe if you can take a look at this (i screenshot the part that show two different lr despite having the same lr_schedule in the config and resuming from ckpt)
I think you are correct. I'll take a look. I remember seeing the same issue before. A temporary bypass would be to modify the metafile by hand.
Running into the same issue. After resuming training with Nanoset, LR is wrong value and stays constant.
Added some extra logging to see whether the LR scheduler is initialized correctly:
[0434:0]:11/01/2024 11:05:18 [INFO|DP=0|PP=0|TP=0|lrdn0434]: Learning rate scheduler state: {'base_lrs': [0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003], 'last_epoch': 12000, 'verbose': False, '_step_count': 12001, '_get_lr_called_within_step': False, '_last_lr': [3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05], 'lr_lambdas': [{}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}]}
[0434:0]:11/01/2024 11:05:18 [INFO|DP=0|PP=0|TP=0|lrdn0434]: iteration: 12001 / 17500 | consumed_tokens: 25.2G | elapsed_time_per_iteration_ms: 28.2K | tokens_per_sec: 74.5K | tokens_per_sec_per_gpu: 2.33K | global_batch_size: 1.02K | lm_loss: 2.3 | lr: 0.000289 | model_tflops_per_gpu: 26.7 | hardware_tflops_per_gpu: 26.7 | grad_norm: 0.15 | cuda_memory_allocated: 8.86G | cuda_max_memory_reserved: 34.1G | hd_total_memory_tb: 9.25G | hd_used_memory_tb: 8.21G | hd_free_memory_tb: 1.03G
[0434:0]:11/01/2024 11:05:26 [INFO|DP=0|PP=0|TP=0|lrdn0434]: Learning rate scheduler state: {'base_lrs': [0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003], 'last_epoch': 12001, 'verbose': False, '_step_count': 12002, '_get_lr_called_within_step': False, '_last_lr': [0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615], 'lr_lambdas': [{}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}]}
The LR scheduler is initialized correctly before the first self.lr_scheduler.step()
is called. It has the correct _last_lr
from the previous training run (3.11549437145356e-05
).
However, this gets overwritten in the next iteration by a number that doesn't make sense (0.0002889
). I'm having a hard time identifying where this would be happening in the code. I'll try and step through it line by line some other day and take a closer look.
Could you be explicit about what your temporary bypass would look like @zzhhjjj ? What needs to be edited?
Regarding the other bug with number of remaining steps calculation: I don't think it has any effect as that information is not used anywhere by nanoset as far as I can see.
On our side problem was solved by itself (not happened again for weeks now). As as I remember we re-updated the repo just before so maybe could be worth trying.Le 1 nov. 2024 à 11:29, Faton @.***> a écrit : Added some extra logging to see whether the LR scheduler is initialized correctly: [0434:0]:11/01/2024 11:05:18 [INFO|DP=0|PP=0|TP=0|lrdn0434]: Learning rate scheduler state: {'base_lrs': [0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003], 'last_epoch': 12000, 'verbose': False, '_step_count': 12001, '_get_lr_called_within_step': False, '_last_lr': [3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05, 3.11549437145356e-05], 'lr_lambdas': [{}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}]} [0434:0]:11/01/2024 11:05:18 [INFO|DP=0|PP=0|TP=0|lrdn0434]: iteration: 12001 / 17500 | consumed_tokens: 25.2G | elapsed_time_per_iteration_ms: 28.2K | tokens_per_sec: 74.5K | tokens_per_sec_per_gpu: 2.33K | global_batch_size: 1.02K | lm_loss: 2.3 | lr: 0.000289 | model_tflops_per_gpu: 26.7 | hardware_tflops_per_gpu: 26.7 | grad_norm: 0.15 | cuda_memory_allocated: 8.86G | cuda_max_memory_reserved: 34.1G | hd_total_memory_tb: 9.25G | hd_used_memory_tb: 8.21G | hd_free_memory_tb: 1.03G [0434:0]:11/01/2024 11:05:26 [INFO|DP=0|PP=0|TP=0|lrdn0434]: Learning rate scheduler state: {'base_lrs': [0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003, 0.0003], 'last_epoch': 12001, 'verbose': False, '_step_count': 12002, '_get_lr_called_within_step': False, '_last_lr': [0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615, 0.00028892609384716615], 'lr_lambdas': [{}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}]}
I'm resuming from a checkpoint saved at step 12000. Training has only a single data stage. Total train steps in config is set to 17500.
The LR scheduler is initialized correctly before the first self.lr_scheduler.step() is called. It has the correct _last_lr from the previous training run. However, this gets overwritten in the next iteration by a number that doesn't make sense. I'm having a hard time identifying where this would be happening in the code. I'll try and step through it line by line some other day and take a closer look. Could you be explicit about what your temporary bypass would look like @zzhhjjj ? What needs to be edited? Regarding the other bug with number of remaining steps calculation: I don't think it has any effect as that information is not used anywhere by nanoset as far as I can see.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>
I have been using the latest commit 51ca40bc5e1b1f5dcb55eaeb0b6f86dda03f3979
of the repo when running these experiments.
Retraining on checkpoint works perfectly with the tokenization on the fly, but breaks while using nanoset: training restart with a different lr, which is not the same as lr_schedule.pt
We also have two additional issues that are likely connected:
Training tested with this configuration: