Closed scotlandowl closed 4 months ago
llamafactory
# 运行指令 CUDA_VISIBLE_DEVICES=0,1 FORCE_TORCHRUN=1 llamafactory-cli train examples/train_lora/llama3_lora_sft_ds3.yaml # llama3_lora_sft_ds3.yaml ### model model_name_or_path: /gemini/Qwen1.5-14B-Chat ### method stage: sft do_train: true finetuning_type: lora lora_target: all deepspeed: examples/deepspeed/ds_z3_config.json ### dataset dataset: llama3_law template: qwen cutoff_len: 1024 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16 ### output output_dir: saves/Qwen-14B/lora/sft logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true ### train per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 fp16: true ddp_timeout: 180000000 ### eval val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500 # 终端显示 [2024-06-14 03:48:29,187] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible 06/14/2024 03:48:35 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:24175 WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [2024-06-14 03:48:51,302] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible [2024-06-14 03:48:51,999] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible [2024-06-14 03:48:58,181] [INFO] [comm.py:637:init_distributed] cdb=None [2024-06-14 03:48:58,784] [INFO] [comm.py:637:init_distributed] cdb=None [2024-06-14 03:48:58,784] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Using config file: /etc/orion/env/env.conf Using config file: /etc/orion/env/env.conf # 显卡状态 +--------------------------------------------------------------------------------------------+ | ORION-SMI 1.0 Time: 2024-06-14 03:59:46 CUDA Version: N/A | +-----------------------------------------------+----------------------+---------------------+ | IP vGPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC| | pGPU vGPU Physical GPU Name | Memory-Usage | GPU-Util Compute M.| |===============================================+======================+=====================| | 10.169.5.3 Orion vGPU Off | N/A Off | N/A | | 2 0 B1.gpu.xlarge | 20006MiB / 24258MiB | 99% Default | +--------------------------------------------------------------------------------------------+ | 10.169.5.3 Orion vGPU Off | N/A Off | N/A | | 6 0 B1.gpu.xlarge | 20006MiB / 24258MiB | 0% Default | +--------------------------------------------------------------------------------------------+ +--------------------------------------------------------------------------------------------+ | Processes: vGPU Memory | | IP pGPU vGPU PID Type Process name Usage | |============================================================================================| | 10.169.5.3 2 0 3397 C python 20006MiB | | 10.169.5.3 6 0 3396 C python 20006MiB | +--------------------------------------------------------------------------------------------+
No response
正常
Reminder
System Info
llamafactory
version: 0.8.2.dev0Reproduction
Expected behavior
No response
Others
No response