lessw2020 / Ranger21

Ranger deep learning optimizer rewrite to use newest components
Apache License 2.0
321 stars 45 forks source link

optimizer = Ranger21(params=model.parameters(), lr=learning_rate) File "/mnt/Drive1/florian/msblob/Ranger21/ranger21/ranger21.py", line 179, in __init__ self.total_iterations = num_epochs * num_batches_per_epoch TypeError: unsupported operand type(s) for *: 'NoneType' and 'NoneType' #12

Open neuronflow opened 3 years ago

neuronflow commented 3 years ago

I get the following error when starting my training:

Traceback (most recent call last):
  File "tr_baseline.py", line 75, in <module>
    optimizer = Ranger21(params=model.parameters(), lr=learning_rate)
  File "/mnt/Drive1/florian/msblob/Ranger21/ranger21/ranger21.py", line 179, in __init__
    self.total_iterations = num_epochs * num_batches_per_epoch
TypeError: unsupported operand type(s) for *: 'NoneType' and 'NoneType'

initializing ranger with:

# ranger:
optimizer = Ranger21(params=model.parameters(), lr=learning_rate)
saruarlive commented 3 years ago

Well, have you tried as shown below,

from ranger21 import Ranger21 
optimizer = Ranger21(model.parameters(), lr = 1e-02, num_epochs = epochs, num_batches_per_epoch = len(train_loader))
lessw2020 commented 3 years ago

Hi @neuronflow, @saruarlive is correct - the issue is we need to know how many epochs and how many iterations per epoch in order to auto-compute the lr schedule. Clearly our error handling should be improved to make it clear the issue (I thought we were checking for this case) but from the error listed above, it's basically saying the num_epochs = None, num_batches_per_epoch=None, and it can't do any math with it. I'll leave this open until I verify and add some better error handling, but the core issue is you need to pass in the total epochs and num_iterations (and we need to document this better).

neuronflow commented 3 years ago

thank you, with the above the training seems to start until it's crashing with this error:

  File "/mnt/Drive3/florian/multi_patch_blob_loss/neuronflow/training/epoch/trainEpoch.py", line 77, in train_epoch
    optimizer.step()
  File "/home/florian/miniconda3/envs/msblob/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/Drive1/florian/msblob/Ranger21/ranger21/ranger21.py", line 570, in step
    self.agc(p)
  File "/mnt/Drive1/florian/msblob/Ranger21/ranger21/ranger21.py", line 398, in agc
    p_norm = self.unit_norm(p).clamp_(self.agc_eps)
  File "/mnt/Drive1/florian/msblob/Ranger21/ranger21/ranger21.py", line 382, in unit_norm
    raise ValueError(
ValueError: unit_norm:: adaptive gclipping: unable to process len of 5 - currently must be <= 4

If I understand it correctly ranger21 contains a lr scheduler, so it does not sense to combine it with cosine annealing and warm restarts?

lessw2020 commented 3 years ago

Hi @neuronflow, The valueerror above comes from having 4 or more dimensions ala 3D convolutions. If you pull the latest version that I posted last week then it adaptive clipping will handle any size dimensions so that is resolved.

To your other point - by default Ranger21 will handle the lr scheduling internally for you, so you would not want to use with cosine annealing or other lr scheduling. You can of course turn off internal lr scheduling if you want to compare using Ranger21 internal scheduling vs your own scheduler...I wouldn't recommend it since there's a lot of validation behind the schedule Ranger21 sets, but certainly you can test it out to see. You can turn off scheduling by removing the warmup: use_warmup=False, and the warmdown: warmdown_active=False,

I can see that it might be simpler if had a single use_lr_scheduling = True/False so I think I'll add that soon...but for now, turning warmup and warmdown off will have r21 operate as an optimizer with no scheduling, and then you can drive the lr with your own schedule. Hope that helps!

neuronflow commented 3 years ago

thank you once again, for the fast and detailed response with the latest update it seems to work! :)

neuronflow commented 3 years ago

One further question, I have a training where I use multiple training data loaders with different batch length..is it possible to apply ranger21 in this context?