shibing624 / MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。
Apache License 2.0
3.24k stars 492 forks source link

lora增量预训练支持qwen-14b和qwen-72b么? #282

Closed zhuxiaobin closed 9 months ago

zhuxiaobin commented 10 months ago

请问下: 1,lora增量预训练支持qwen-14b和qwen-72b么? 2,另外lora增量预训练是否支持对Qwen-72B-Chat-Int4这种量化后的进行预训练?还是说只能对Qwen-72B-Chat预训练后,自己做量化?

zhuxiaobin commented 10 months ago

我用Qwen-14B-Chat训练,3张A100,报了以下错误: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.50 GiB (GPU 0; 79.20 GiB total capacity; 74.65 GiB already allocated; 1.83 GiB free; 75.96 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

运行代码是: CUDA_VISIBLE_DEVICES=0,1,2 torchrun --nproc_per_node 3 pretraining.py \ --model_type auto \ --model_name_or_path qwen/Qwen-14B-Chat \ --train_file_dir ../book \ --validation_file_dir ../book \ --qlora True \ --load_in_4bit True \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --do_train \ --do_eval \ --use_peft True \ --seed 42 \ --max_train_samples -1 \ --max_eval_samples -1 \ --num_train_epochs 0.5 \ --learning_rate 2e-4 \ --warmup_ratio 0.05 \ --weight_decay 0.01 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 13 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 10 \ --block_size 2048 \ --output_dir outputs-pt-qwen-14b \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype bfloat16 \ --optim paged_adamw_32bit \ --bf16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True \ --cache_dir ./cache

是我设置的参数有问题么?

shibing624 commented 10 months ago

CUDA_VISIBLE_DEVICES=0,1,2 python pretraining.py ,然后--per_device_train_batch_size 2 跑,成功后再扩大。