jiaweizzhao / GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Apache License 2.0
1.24k stars 131 forks source link

linalg.svd: The algorithm failed to converge #26

Closed Blueman2 closed 3 months ago

Blueman2 commented 3 months ago

I have converted the GaLore code to C++ (libtorch) and currently running into an issue where large layers (the initial embedding layer) is failing at SVD.

Layer is 31618x2624 I am running with full_matrices set to false already.

[W BatchLinearAlgebraLib.cpp:703] Warning: torch.linalg.svd: During SVD computation with the selected cusolver driver, batches 0 failed to converge. A more accurate method will be used to compute the SVD as a fallback. Check doc at https://pytorch.org/docs/stable/generated/torch.linalg.svd.html (function operator ()) [08:32:34.6859554] Projection failed: linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values (error code: 2623).

Is it a known issue and is there any workarounds for this?

matthewdouglas commented 3 months ago

@Blueman2 Apply only for linear layers.

Blueman2 commented 3 months ago

False alarm, first step had nan gradients thanks to gradient rescaling for fp16.

pedramrst commented 2 months ago

@Blueman2 How did you solve that? I encountered this problem while training the Gemma-7B model. I applied GaLore only to the self-attention and MLP layers, just like the provided scripts. However, the loss value is too high and it has not converged after the first step.