lixilinx / psgd_torch

Pytorch implementation of preconditioned stochastic gradient descent (affine group preconditioner, low-rank approximation preconditioner and more)
102 stars 7 forks source link

Training instability #2

Closed tlh24 closed 4 months ago

tlh24 commented 4 months ago

Hello, trying out your optimizer, and am running into training instability - NaN issues. Do you have any advice for debugging this?

The setup code is copied from your mnist - LeNet5 example (which converges without problem). See below for the drop-in. The model in question is a modified transformer with L1 instead of dot-product attention; it converges with AdamW, SGD, AdaDelta (at varying speeds).

Posting an issue as this may affect other users.

loss = torch.sum((yp - y)**2) # simple MSE error
if use_AdamW: 
    loss.backward()
    optimizer.step() 
else: 
    grads = torch.autograd.grad(loss, model.parameters(), create_graph=True)
    vs = [torch.randn_like(W) for W in model.parameters()]
    Hvs = torch.autograd.grad(grads, model.parameters(), vs) 
    with torch.no_grad():
        Qs = [psgd.update_precond_kron(Qlr[0], Qlr[1], v, Hv) for (Qlr, v, Hv) in zip(Qs, vs, Hvs)]
        pre_grads = [psgd.precond_grad_kron(Qlr[0], Qlr[1], g) for (Qlr, g) in zip(Qs, grads)]
        grad_norm = torch.sqrt(sum([torch.sum(g*g) for g in pre_grads]))
        lr_adjust = min(grad_norm_clip_thr/grad_norm, 1.0)
        [W.subtract_(lr_adjust*lr*g) for (W, g) in zip(model.parameters(), pre_grads)]
yaroslavvb commented 4 months ago

Have you tried lowering the learning rate?

On Thu, Feb 8, 2024, 3:10 PM Tim Hanson @.***> wrote:

Hello, trying out your optimizer, and am running into training instability

  • NaN issues. Do you have any advice for debugging this?

The setup code is copied from your mnist - LeNet5 example (which converges without problem). See below for the drop-in. The model in question is a modified transformer with L1 instead of dot-product attention; it converges with AdamW, SGD, AdaDelta (at varying speeds).

Posting an issue as this may affect other users.

loss = torch.sum((yp - y)*2) # simple MSE error if use_AdamW: loss.backward() optimizer.step() else: grads = torch.autograd.grad(loss, model.parameters(), create_graph=True) vs = [torch.randn_like(W) for W in model.parameters()] Hvs = torch.autograd.grad(grads, model.parameters(), vs) with torch.no_grad(): Qs = [psgd.update_precond_kron(Qlr[0], Qlr[1], v, Hv) for (Qlr, v, Hv) in zip(Qs, vs, Hvs)] pre_grads = [psgd.precond_grad_kron(Qlr[0], Qlr[1], g) for (Qlr, g) in zip(Qs, grads)] grad_norm = torch.sqrt(sum([torch.sum(gg) for g in pre_grads])) lr_adjust = min(grad_norm_clip_thr/gradnorm, 1.0) [W.subtract(lr_adjustlrg) for (W, g) in zip(model.parameters(), pre_grads)]

— Reply to this email directly, view it on GitHub https://github.com/lixilinx/psgd_torch/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAFUHFKRMYB4ZRVW2LTEFTYSVLPZAVCNFSM6AAAAABDASJN3SVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEZDMMRRGUYDSOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

opooladz commented 4 months ago

If you are naning out you can reduce the lr of the preconditioner to 0.001 and see if that helps. We run a SGD to fit the curvature estimate. For standard CNNs an lr or 0.01 is usually sufficient, for the attention layers (which tend to be sparse) we reduce the lr to de-sparcify the grad space. For a transformer example you can consider my repo for a nanoGPT experiment.

Here is a google colab example of PSGD working well with L1Attenton on toy synthetic data, finding a solution with loss 0. If you post your full code I can take a look.

tlh24 commented 4 months ago

Right, I've tried scaling the lr parameter down, and it doesn't seem to help. Starting from a known-good parameter setting (courtesy of AdamW) makes it NaN faster, surprisingly.

Let me have a look at your repo...

On Thu, Feb 8, 2024 at 4:30 PM Omead Pooladzandi @.***> wrote:

If you are naning out you can reduce the lr of the preconditioner to 0.001 and see if that helps. For a transformer example you can consider my repo https://github.com/opooladz/Preconditioned-Stochastic-Gradient-Descent/tree/main/nanoGPT-PSGD .

— Reply to this email directly, view it on GitHub https://github.com/lixilinx/psgd_torch/issues/2#issuecomment-1935143913, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOLULS2ISEXAQ2NKOSTBPDYSVU4HAVCNFSM6AAAAABDASJN3SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZVGE2DGOJRGM . You are receiving this because you authored the thread.Message ID: @.***>

opooladz commented 4 months ago

We have a new API for the optimizers. The one you posted was a bit outdated. XMat is cross diagonal of the Hessian LRA is a low rank approximation of the Hessian Newton is the dense fitting Affine is the kroner factorization.

For black box I would recommend LRA or XMat. The Google colab above shows both of these APIs on the L1 attention. In this new API there are two LRs. LR params is the learning rate for the weights of the NN. LR precond is the LR of learning the curvature. Reducing the LR of precond should help reduce divergence. Also there is a gradient clipping argument. You can try setting that to 100 or 1.

tlh24 commented 4 months ago

Thanks for the quick responses! The new API demonstrated in the collab notebook works great, fixed the instability.

AdamW, L1 attention (CUDA implemenation), zeroinit: image PSGD, L1 attention (CUDA implementation - finite differences Hessian approx), zeroinit: image More loss plots here, including comparison to dot-product attention: https://docs.google.com/document/d/1Pp3UdY1LfegG93POeqCEuoULyWhCKAq7vOlqqEP__ws/edit?usp=sharing

Final question before we can mark as closed: weight regularization best done through the loss? Or is there a parameter for that, ala weight decay?

opooladz commented 4 months ago

I took a look at ur document. I would set lr_params=0.1 or 0.01. 0.001 is a bit too low for PSGD. We have a normalized learning rate so lr_params = 0.01 should be fine for most things. Technically having an lr > 1 doesnt make sense for PSGD but empirically ive seen it work well up to 10 even. As you increase the lr_params you can consider reducing the grad_clip_max_norm from 100->1 if you are seeing divergence.

There is another parameter lr_preconditioner that is also default = 0.01 but one can reduce or increase that to get different performance as well (I see you have adjusted this). If you increase that lr it will update the curvature information fast and converge quicker but there are times it wont find the optimal solution, and will usually act more like Adam or SGD.

Also you can (and probs should) on the fly reduce both lr's, here is an example of this.

If your NN is small enough one can use the Newton preconditioner for strong results. If you want to just use the diagonal preconditioner you can set the rank=0 in the LRA version of PSGD.

On weight decay, I think you are asking if we have some decoupled adam style weight decay. We do not. We find that this style of wd really only helps regret style optimizers and doesnt really add any benefit to PSGD or really SGD. Although I did not test it extensively on transformers, which is where decoupled weight decay is king. We have an implicit regularization on the weights, that is, the delta in the gradients and the delta in the parameters of the NN are balanced. I suspect this is why that style of weight decay doesnt really help in PSGD as the weights and gradients are already balanced. You can read more about this in PSGD Xilin 2015 criterion 3. For us we find that doing the explicit weight decay as a part of the loss works just fine, we also find that adding some randomness to the decay value tends to break some symmetries to find a better solution.

Thank you for your interest in PSGD.

tlh24 commented 4 months ago

Very helpful, thank you.
Training with higher lr_params and lr_preconditioner both at 0.01 with gradient clipping set at 5 seems to have solved any remaining instability.

optimizer = psgd.LRA(model.parameters(),lr_params=0.01,lr_preconditioner=0.01, momentum=0.9, preconditioner_update_probability=0.1, exact_hessian_vector_product=False, rank_of_approximation=10, grad_clip_max_norm=5)

Yes, was asking about decoupled weight decay -- I'll add L2 weight regularization with randomness as you suggest.