jquesnelle / yarn

YaRN: Efficient Context Window Extension of Large Language Models
MIT License
1.24k stars 109 forks source link

OOM error of distributed training on 80GB GPUs with Mistral-7b #59

Open TracyPlus opened 2 months ago

TracyPlus commented 2 months ago

I run the following train.sh on Mistral-7b:

accelerate launch finetune.py \
    --output-dir output/yarn-mistral-7b-64k \
    --model mistralai/Mistral-7B-v0.1 \
    --architecture mistral \
    --scaling-factor 8 \
    --max-position-embeddings 4096tr \
    --dataset emozilla/yarn-train-tokenized-16k-mistral \
    --sliding-window-attention-schedule 65536 \
    --lr-schedule constant \
    --learning-rate 0.000001 \
    --max-train-steps 1000

with accelerate config as:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 2,3,4,5,6,7
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

but I encountered OutOfMemory problem on my 80G A800s:

截屏2024-04-06 20 19 52 截屏2024-04-06 20 19 05

I don't know if there's something wrong with my distributed training configuration、、🥺 Hope someone help me、、、🙏🙏

YL-9 commented 1 month ago

I run the following train.sh on Mistral-7b:

accelerate launch finetune.py \
    --output-dir output/yarn-mistral-7b-64k \
    --model mistralai/Mistral-7B-v0.1 \
    --architecture mistral \
    --scaling-factor 8 \
    --max-position-embeddings 4096tr \
    --dataset emozilla/yarn-train-tokenized-16k-mistral \
    --sliding-window-attention-schedule 65536 \
    --lr-schedule constant \
    --learning-rate 0.000001 \
    --max-train-steps 1000

with accelerate config as:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 2,3,4,5,6,7
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

but I encountered OutOfMemory problem on my 80G A800s: 截屏2024-04-06 20 19 52

截屏2024-04-06 20 19 05

I don't know if there's something wrong with my distributed training configuration、、🥺 Hope someone help me、、、🙏🙏

I also encountered this problem, have you solved it now? @TracyPlus

Kwen-Chen commented 1 month ago

I also encountered this problem when i use Yarn by Llama2