Open kyleliang919 opened 6 months ago
To replicate the above results, run cmd in README, machine configuration: A100 80GB, CUDA version: 11.8, other environments are installed following the recommendation in the repo
# LLaMA-7B, 8-bit GaLore-Adam, single GPU, activation checkpointing # bsz=16, 22.8G, torchrun --standalone --nproc_per_node 1 torchrun_main.py \ --model_config configs/llama_7b.json \ --lr 0.005 \ --galore_scale 0.25 \ --rank 1024 \ --update_proj_gap 500 \ --batch_size 16 \ --total_batch_size 512 \ --activation_checkpointing \ --num_training_steps 150000 \ --warmup_steps 15000 \ --weight_decay 0 \ --grad_clipping 1.0 \ --dtype bfloat16 \ --eval_every 1000 \ --single_gpu \ --optimizer galore_adamw8bit_per_layer
@kyleliang919 This may be related to the issue I just posted. [ #45 ]
To replicate the above results, run cmd in README, machine configuration: A100 80GB, CUDA version: 11.8, other environments are installed following the recommendation in the repo