Closed Joris-Fu closed 5 months ago
Please try to use the docker in the repo.
Also, please share the full accelerator cmd and deepspeed config you used to start the training.
ds config compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 128 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
cmd accelerate launch \ --config_file ds_zero2.yaml \ peft_lora_embedding_semantic_search.py \ --dataset_name similarity_dataset_tmp \ --max_length 4096 \ --model_name_or_path /model \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --learning_rate 0.0001 \ --weight_decay 0.01 \ --max_train_steps 10000 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type linear \ --num_warmup_steps 100 \ --output_dir trained_model \ --use_peft
same error in docker with given image
seems only support max_length=1024. when max_length=2048, gpu memory oom
Also noticing this. Cannot get above max_length=1536 with 80gb a100. Have a need for at least 6k context for training data.
Even though model support upto 32K tokens, the recommend length by the author is 4K and model fine-tuning is done using a max length of 512 tokens
ds config compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 128 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
cmd accelerate launch --config_file ds_zero2.yaml peft_lora_embedding_semantic_search.py --dataset_name similarity_dataset_tmp --max_length 4096 --model_name_or_path /model --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 0.0001 --weight_decay 0.01 --max_train_steps 10000 --gradient_accumulation_steps 1 --lr_scheduler_type linear --num_warmup_steps 100 --output_dir trained_model --use_peft
same error in docker with given image
Also gradient accumulation steps in the deepspeed config and accelerate config should match
Even though model support upto 32K tokens, the recommend length by the author is 4K and model fine-tuning is done using a max length of 512 tokens
I read the paper again, and indeed, that's the case.
In addition,every 100 steps (not optimization steps) cost 3 minutes, is it normal for 7B model with batch size 1, max_length 1024, A100-80G*1. Seems very slow.
train batch size: 1 gradient accumulation: 128 ds stage 2/3 model: llama "transformers_version": "4.35.2" pytorch: 2.0.1 cuda:11.7
error log: Traceback (most recent call last): File "/path/peft_lora_embedding_semantic_search.py", line 615, in
main()
File "/path/peft_lora_embedding_semantic_search.py", line 542, in main
positiveembs = model(**{k.replace("positive", ""): v for k, v in batch.items() if "positive" in k})
File "/env_path/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, kwargs)
File "/env_path/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, *kwargs)
File "/env_path/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1818, in forward
loss = self.module(inputs, kwargs)
File "/env_path/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, kwargs)
File "/env_path/lib/python3.10/site-packages/peft/peft_model.py", line 1674, in forward
return self.base_model(
File "/env_path/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, *kwargs)
File "/path/peft_lora_embedding_semantic_search.py", line 271, in forward
transformer_outputs = self.model(
File "/env_path/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(args, kwargs)
File "/env_path/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward
layer_outputs = decoder_layer(
File "/env_path/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, *kwargs)
File "/env_path/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/env_path/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(args, **kwargs)
File "/env_path/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 390, in forward
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 79.35 GiB total capacity; 77.15 GiB already allocated; 421.19 MiB free; 77.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 937121)
only trainable params: 45,088,768, should not be out of memory on A100-80G