zhanglv0209 commented 9 months ago

sft

/mnt/nvme0n1/zhanglv/venv/llama_etuning/bin/deepspeed --include localhost:0,1,2,3 --master_port=9888 src/train_bash.py \ --deepspeed ds_config.json \ --stage sft \ --model_name_or_path /mnt/nvme0n1/zhanglv/model/llama2-Chinese-7b-Chat/ \ --do_train \ --dataset sft_test_llama_kuo_32k \ --ddp_timeout 36000 \ --template llama2 \ --finetuning_type lora \ --lora_target q_proj,v_proj \ --output_dir /mnt/nvme1n1/zhanglv/model/out/sft/llama2-Chinese-7b-Chat-qlore-20231122-llmfactory \ --overwrite_cache \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 3.0 \ --plot_loss \ --bf16 \ --flash_attn \ --rope_scaling linear \ --shift_attn \ --cutoff_len 32768 \ --preprocessing_num_workers 15 \ --cache_path /mnt/nvme1n1/zhanglv/model/out/sft/llama2-Chinese-7b-Chat-qlore-20231122-llmfactory-tokenize

得到模型之后，进行推理

instruction = """[INST] <>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

        If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n{} [/INST]"""

prompt = instruction.format("你好," * 6000) generate_ids = model.generate(tokenizer(prompt, return_tensors='pt').input_ids.cuda(), max_new_tokens=32768 ) output_text = tokenizer.decode(generate_ids[0], skip_special_tokens=True) output_text

报错： 165 output = old_forward(*args, **kwargs) 166 return module._hf_hook.post_forward(module, output)

File /mnt/nvme0n1/zhanglv/venv/small_project/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:389, in LlamaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache, padding_mask) 386 attn_weights = attn_weights + attention_mask 388 # upcast attention to fp32 --> 389 attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype) 390 attn_output = torch.matmul(attn_weights, value_states) 392 if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):

OutOfMemoryError: CUDA out of memory. Tried to allocate 19.61 GiB (GPU 0; 79.19 GiB total capacity; 60.74 GiB already allocated; 15.55 GiB free; 62.37 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

进行上下文32k的sft，推理的时候报错，请问该怎么处理

hiyouga commented 9 months ago

看起来没有成功用上 FlashAttention

zhanglv0209 commented 9 months ago

看起来没有成功用上 FlashAttention

是推理的时候，还是sft？

hiyouga commented 9 months ago

推理时候

zhanglv0209 commented 9 months ago

FlashAttention

不好意思，推理阶段，这个怎么加上？可否给个样例，不甚感激

zhanglv0209 commented 9 months ago

FlashAttention

不好意思，推理阶段，这个怎么加上？可否给个样例，不甚感激

找到了：use_flash_attention_2=True

hiyouga / LLaMA-Factory

llama sft上下文扩展32k，推理报错 #1615

sft