jiaweizzhao / GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Apache License 2.0
1.42k stars 147 forks source link

Zero Loss: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values #58

Open akjindal53244 opened 3 months ago

akjindal53244 commented 3 months ago

Hi GaLore Team, congratulations for the interesting work!

I am trying to fine-tune llama-3 8B model using GaLore but getting this error: torch._C._LinAlgError: linalg.svd: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated singular values.

datasets:

dataset_prepared_path: val_set_size: 0.01 output_dir: ./outputs/galore-out

sequence_len: 2048 sample_packing: false eval_sample_packing: true pad_to_sequence_len: true

gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 4 optimizer: galore_adamw_8bit_layerwise lr_scheduler: cosine learning_rate: 0.000001

optim_target_modules:

train_on_inputs: false group_by_length: false bf16: true tf32: false

bfloat16: true

logging_steps: 4 flash_attention: true

BaohaoLiao commented 2 months ago

You can always get the normal loss for the first batch, because the SVD is calculated based on the gradients from the loss.

is it possible for you to offer the reproduction code?