out of memory on A100-80G.

Joris-Fu commented 5 months ago

train batch size: 1 gradient accumulation: 128 ds stage 2/3 model: llama "transformers_version": "4.35.2" pytorch: 2.0.1 cuda:11.7

error log: Traceback (most recent call last): File "/path/peft_lora_embedding_semantic_search.py", line 615, in main() File "/path/peft_lora_embedding_semantic_search.py", line 542, in main positiveembs = model(**{k.replace("positive", ""): v for k, v in batch.items() if "positive" in k}) File "/env_path/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/env_path/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) File "/env_path/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1818, in forward loss = self.module(inputs, kwargs) File "/env_path/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, kwargs) File "/env_path/lib/python3.10/site-packages/peft/peft_model.py", line 1674, in forward return self.base_model( File "/env_path/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, *kwargs) File "/path/peft_lora_embedding_semantic_search.py", line 271, in forward transformer_outputs = self.model( File "/env_path/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(args, kwargs) File "/env_path/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward layer_outputs = decoder_layer( File "/env_path/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, *kwargs) File "/env_path/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/env_path/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(args, **kwargs) File "/env_path/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 390, in forward attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 79.35 GiB total capacity; 77.15 GiB already allocated; 421.19 MiB free; 77.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 937121)

only trainable params: 45,088,768, should not be out of memory on A100-80G

kamalkraj commented 5 months ago

Please try to use the docker in the repo.

Also, please share the full accelerator cmd and deepspeed config you used to start the training.

Joris-Fu commented 5 months ago

ds config compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 128 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

cmd accelerate launch \ --config_file ds_zero2.yaml \ peft_lora_embedding_semantic_search.py \ --dataset_name similarity_dataset_tmp \ --max_length 4096 \ --model_name_or_path /model \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --learning_rate 0.0001 \ --weight_decay 0.01 \ --max_train_steps 10000 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type linear \ --num_warmup_steps 100 \ --output_dir trained_model \ --use_peft

same error in docker with given image

Joris-Fu commented 5 months ago

seems only support max_length=1024. when max_length=2048, gpu memory oom

Tostino commented 5 months ago

Also noticing this. Cannot get above max_length=1536 with 80gb a100. Have a need for at least 6k context for training data.

kamalkraj commented 5 months ago

Even though model support upto 32K tokens, the recommend length by the author is 4K and model fine-tuning is done using a max length of 512 tokens

kamalkraj commented 5 months ago

ds config compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 128 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

cmd accelerate launch --config_file ds_zero2.yaml peft_lora_embedding_semantic_search.py --dataset_name similarity_dataset_tmp --max_length 4096 --model_name_or_path /model --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 0.0001 --weight_decay 0.01 --max_train_steps 10000 --gradient_accumulation_steps 1 --lr_scheduler_type linear --num_warmup_steps 100 --output_dir trained_model --use_peft

same error in docker with given image

Also gradient accumulation steps in the deepspeed config and accelerate config should match

Joris-Fu commented 5 months ago

Even though model support upto 32K tokens, the recommend length by the author is 4K and model fine-tuning is done using a max length of 512 tokens

I read the paper again, and indeed, that's the case.

In addition，every 100 steps (not optimization steps) cost 3 minutes, is it normal for 7B model with batch size 1, max_length 1024, A100-80G*1. Seems very slow.

kamalkraj / e5-mistral-7b-instruct

out of memory on A100-80G. #2