hiyouga / LLaMA-Factory

Unify Efficient Fine-Tuning of 100+ LLMs
Apache License 2.0
25.52k stars 3.16k forks source link

使用A10对qwen-14b-chat进行Lora微调,2机2卡训练比1机2卡慢了10倍 #4620

Closed WangxuP closed 4 days ago

WangxuP commented 4 days ago

Reminder

System Info

基础模型

Qwen-14b-chat

训练数据

共4000条数据。单条数据长度不长于1024。

cuda版本

NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4

python

3.10.14

训练环境

我们使用的是百度云的A10机器,机器之间通过百度云内网进行通信,传输文件的网速大致为150M/s

Reproduction

# master
CUDA_VISIBLE_DEVICES=0 FORCE_TORCHRUN=1 NNODES=2 RANK=0 MASTER_ADDR=192.168.32.8 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft_ds3.yaml

# node
FORCE_TORCHRUN=1 NNODES=2 RANK=1 MASTER_ADDR=192.168.32.8 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/llama3_lora_sft_ds3.yaml

llama3_lora_sft_ds3.yaml

### model
model_name_or_path: /home/models/Qwen-14B-Chat/

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
deepspeed: examples/deepspeed/ds_z3_offload_config.json

### dataset
dataset: time_change5_llama
template: qwen
cutoff_len: 1024
max_samples: 1000000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/qwen/lora/sft
logging_steps: 10
save_steps: 50
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 5
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

ds_z3_offload_config.json

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

image

requirements.txt

(llamafactory) [root@instance-67wbmebl LLaMA-Factory-0.8.2]# pip list
Package                       Version     Editable project location
----------------------------- ----------- -----------------------------
accelerate                    0.31.0
aiofiles                      23.2.1
aiohttp                       3.9.5
aiosignal                     1.3.1
altair                        5.3.0
annotated-types               0.7.0
anyio                         3.7.1
async-timeout                 4.0.3
attrs                         23.2.0
certifi                       2024.6.2
charset-normalizer            3.3.2
click                         8.1.7
contourpy                     1.2.1
cycler                        0.12.1
datasets                      2.20.0
deepspeed                     0.14.4
dill                          0.3.8
dnspython                     2.6.1
docstring_parser              0.16
einops                        0.8.0
email_validator               2.2.0
exceptiongroup                1.2.1
fastapi                       0.111.0
fastapi-cli                   0.0.4
ffmpy                         0.3.2
filelock                      3.15.4
fire                          0.6.0
flash-attn                    2.5.9.post1
fonttools                     4.53.0
frozenlist                    1.4.1
fsspec                        2024.5.0
gradio                        4.37.1
gradio_client                 1.0.2
h11                           0.12.0
hjson                         3.1.0
httpcore                      0.13.7
httptools                     0.6.1
httpx                         1.0.0b0
huggingface-hub               0.23.4
idna                          3.7
importlib_resources           6.4.0
Jinja2                        3.1.4
jsonschema                    4.22.0
jsonschema-specifications     2023.12.1
kiwisolver                    1.4.5
llamafactory                  0.8.2       /home/wxp/LLaMA-Factory-0.8.2
markdown-it-py                3.0.0
MarkupSafe                    2.1.5
matplotlib                    3.9.0
mdurl                         0.1.2
mpmath                        1.3.0
multidict                     6.0.5
multiprocess                  0.70.16
networkx                      3.3
ninja                         1.11.1
numpy                         1.26.4
nvidia-cublas-cu12            12.1.3.1
nvidia-cuda-cupti-cu12        12.1.105
nvidia-cuda-nvrtc-cu12        12.1.105
nvidia-cuda-runtime-cu12      12.1.105
nvidia-cudnn-cu12             8.9.2.26
nvidia-cufft-cu12             11.0.2.54
nvidia-curand-cu12            10.3.2.106
nvidia-cusolver-cu12          11.4.5.107
nvidia-cusparse-cu12          12.1.0.106
nvidia-ml-py                  12.555.43
nvidia-nccl-cu12              2.20.5
nvidia-nvjitlink-cu12         12.5.40
nvidia-nvtx-cu12              12.1.105
orjson                        3.10.5
packaging                     24.1
pandas                        2.2.2
peft                          0.11.1
pillow                        10.3.0
pip                           24.1.1
protobuf                      5.27.2
psutil                        6.0.0
py-cpuinfo                    9.0.0
pyarrow                       16.0.0
pyarrow-hotfix                0.6
pydantic                      2.8.0b1
pydantic_core                 2.20.0
pydub                         0.25.1
Pygments                      2.18.0
pyparsing                     3.1.2
python-dateutil               2.9.0.post0
python-dotenv                 1.0.1
python-multipart              0.0.9
pytz                          2024.1
PyYAML                        6.0.2rc1
referencing                   0.35.1
regex                         2024.5.15
requests                      2.32.3
rfc3986                       1.5.0
rich                          13.7.1
rpds-py                       0.18.0
ruff                          0.5.0
safetensors                   0.4.3
scipy                         1.14.0
semantic-version              2.10.0
sentencepiece                 0.2.0
setuptools                    69.5.1
shellingham                   1.5.4
shtab                         1.7.1
six                           1.16.0
sniffio                       1.3.1
sse-starlette                 2.1.2
starlette                     0.37.2
sympy                         1.12.1
termcolor                     2.4.0
tiktoken                      0.7.0
tokenizers                    0.19.1
tomlkit                       0.12.0
toolz                         0.12.1
torch                         2.3.1
tqdm                          4.66.4
transformers                  4.42.2
transformers-stream-generator 0.0.5
triton                        2.3.1
trl                           0.9.4
typer                         0.12.3
typing_extensions             4.12.2
tyro                          0.8.5
tzdata                        2024.1
ujson                         5.10.0
urllib3                       2.2.2
uvicorn                       0.30.1
uvloop                        0.19.0
watchfiles                    0.22.0
websockets                    11.0.3
wheel                         0.43.0
xxhash                        3.4.1
yarl                          1.9.4

Expected behavior

No response

Others

No response

WangxuP commented 4 days ago

模型微调的关键日志如下:

(llamafactory) [root@instance-67wbmebl LLaMA-Factory-0.8.2]# CUDA_VISIBLE_DEVICES=0 FORCE_TORCHRUN=1 NNODES=2 RANK=0 MASTER_ADDR=192.168.32.8 MASTER_PORT=29500 llamafact-cli train examples/train_lora/llama3_lora_sft_ds3.yaml
[2024-06-29 15:19:05,408] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
06/29/2024 15:19:07 - INFO - llamafactory.cli - Initializing distributed tasks at: 192.168.32.8:29500
[2024-06-29 15:19:32,522] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-06-29 15:19:34,436] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-29 15:19:34,436] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
06/29/2024 15:19:34 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
06/29/2024 15:19:34 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.float16
[INFO|tokenization_utils_base.py:2159] 2024-06-29 15:19:34,507 >> loading file qwen.tiktoken
[INFO|tokenization_utils_base.py:2159] 2024-06-29 15:19:34,507 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2159] 2024-06-29 15:19:34,508 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2159] 2024-06-29 15:19:34,508 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2159] 2024-06-29 15:19:34,508 >> loading file tokenizer.json
06/29/2024 15:19:34 - INFO - llamafactory.data.template - Add eos token: <|im_end|>
06/29/2024 15:19:34 - INFO - llamafactory.data.template - Add pad token: <|im_end|>
06/29/2024 15:19:34 - INFO - llamafactory.data.loader - Loading dataset time_change5_llama.json...
Converting format of dataset (num_proc=16): 100%|█████████████████████████████████████████████████████████████████████████████| 4000/4000 [00:00<00:00, 29515.42 examples
Running tokenizer on dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████| 4000/4000 [00:15<00:00, 258.17 examples
input_ids:
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 105043, 33, 1570, 104202, 10042, 102064, 20450, 54542, 101057, 3837, 114806, 10042,  5333, 37945, 44063, 31196, 101975, 72881, 99700, 105359, 17714, 105149, 2236, 68805, 1773, 2236, 100630, 9370, 44931, 18830, 2073, 67949, 20450, 3328, 33590, 2073, 552820450, 40906, 33590, 2073, 80565, 20450, 49688, 96332, 31196, 72881, 99700, 28311, 104373, 99609, 27442, 38182, 75108, 100977, 17177, 271, 102808, 43815, 100470, 31526, 6, 68805, 151645, 198, 151644, 77091, 198, 515, 1, 3328, 788, 330, 7319, 7689, 10700, 2129, 756, 1, 40906, 788, 330, 3328, 4358, 355, 20557, 7, 16, 568, 983, 7319, 1916,05, 266, 1462, 7, 24, 11, 20, 19, 1215, 756, 1, 49688, 788, 330, 3328, 4358, 355, 20557, 7, 16, 568, 983, 7319, 1916, 1005, 266, 1462, 7, 24, 11, 20, 19, 1215, 698, 92, 645]
inputs:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user

[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 0, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 515, 1, 3328, , 330, 7319, 7689, 10700, 2129, 756, 1, 40906, 788, 330, 3328, 4358, 355, 20557, 7, 16, 568, 983, 7319, 1916, 1005, 266, 1462, 7, 24, 11, 20, 19, 1215, 756, 1, 49688, 78330, 3328, 4358, 355, 20557, 7, 16, 568, 983, 7319, 1916, 1005, 266, 1462, 7, 24, 11, 20, 19, 1215, 698, 92, 151645]

[INFO|configuration_utils.py:731] 2024-06-29 15:20:34,982 >> loading configuration file /home/models/Qwen-14B-Chat/config.json
[INFO|configuration_utils.py:731] 2024-06-29 15:20:34,983 >> loading configuration file /home/models/Qwen-14B-Chat/config.json
[INFO|configuration_utils.py:800] 2024-06-29 15:20:34,984 >> Model config QWenConfig {
  "_name_or_path": "/home/models/Qwen-14B-Chat/",
  "architectures": [
    "QWenLMHeadModel"
  ],
  "attn_dropout_prob": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_qwen.QWenConfig",
    "AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel"
  },
  "bf16": false,
  "emb_dropout_prob": 0.0,
  "fp16": false,
  "fp32": false,
  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 27392,
  "kv_channels": 128,
  "layer_norm_epsilon": 1e-06,
  "max_position_embeddings": 8192,
  "model_type": "qwen",
  "no_bias": true,
  "num_attention_heads": 40,
  "num_hidden_layers": 40,
  "onnx_safe": null,
  "rotary_emb_base": 10000,
  "rotary_pct": 1.0,
  "scale_attn_weights": true,
  "seq_length": 8192,
  "softmax_in_fp32": false,
  "tie_word_embeddings": false,
  "tokenizer_class": "QWenTokenizer",
  "transformers_version": "4.42.2",
  "use_cache": true,
  "use_cache_kernel": false,
  "use_cache_quantization": false,
  "use_dynamic_ntk": true,
  "use_flash_attn": "auto",
  "use_logn_attn": true,
  "vocab_size": 152064
}

[INFO|modeling_utils.py:3553] 2024-06-29 15:20:35,012 >> loading weights file /home/models/Qwen-14B-Chat/model.safetensors.index.json
[INFO|modeling_utils.py:3698] 2024-06-29 15:20:35,012 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
[INFO|configuration_utils.py:1000] 2024-06-29 15:20:35,017 >> Generate config GenerationConfig {}

Warning: please make sure that you are using the latest codes and checkpoints, especially if you used Qwen-7B before 09.25.2023.请使用最新模型和代码,尤其如果你在9月25日已经开始使用Qwen-7B,千万注意不要使用错误代码和模型。
[2024-06-29 15:21:13,136] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 323, num_elems = 14.17B
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:58<00:00,  3.91s/
[INFO|modeling_utils.py:4364] 2024-06-29 15:22:11,839 >> All model checkpoint weights were used when initializing QWenLMHeadModel.

[INFO|modeling_utils.py:4372] 2024-06-29 15:22:11,839 >> All the weights of QWenLMHeadModel were initialized from the model checkpoint at /home/models/Qwen-14B-Chat/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use QWenLMHeadModel for predictions without further training.
[INFO|configuration_utils.py:953] 2024-06-29 15:22:11,842 >> loading configuration file /home/models/Qwen-14B-Chat/generation_config.json
[INFO|configuration_utils.py:1000] 2024-06-29 15:22:11,842 >> Generate config GenerationConfig {
  "chat_format": "chatml",
  "do_sample": true,
  "eos_token_id": 151643,
  "max_new_tokens": 512,
  "max_window_size": 6144,
  "pad_token_id": 151643,
  "repetition_penalty": 1.1,
  "top_k": 0,
  "top_p": 0.8
}

06/29/2024 15:22:11 - WARNING - llamafactory.model.model_utils.checkpointing - You are using the old GC format, some features (e.g. BAdam) will be invalid.
06/29/2024 15:22:11 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
06/29/2024 15:22:11 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
06/29/2024 15:22:11 - INFO - llamafactory.model.adapter - ZeRO3/FSDP/PureBF16/BAdam detected, remaining trainable params as their original precision.
06/29/2024 15:22:11 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
06/29/2024 15:22:11 - INFO - llamafactory.model.model_utils.misc - Found linear modules: c_attn,c_proj,w1,w2
06/29/2024 15:22:12 - INFO - llamafactory.model.loader - trainable params: 27893760 || all params: 14195184640 || trainable%: 0.1965
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimversion or higher.
[INFO|trainer.py:642] 2024-06-29 15:22:12,247 >> Using auto half precision backend
06/29/2024 15:22:12 - WARNING - llamafactory.extras.callbacks - Previous trainer log in this folder will be deleted.
[INFO|deepspeed.py:329] 2024-06-29 15:22:12,398 >> Detected ZeRO Offload and non-DeepSpeed optimizers: This combination should work as long as the custom optimizer has b CPU and GPU implementation (except LAMB)
Installed CUDA version 12.4 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.2251906394958496 seconds
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000100, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1
[2024-06-29 15:22:12,738] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2024-06-29 15:22:12,768] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-06-29 15:22:12,771] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-06-29 15:22:12,771] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-06-29 15:22:12,799] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2024-06-29 15:22:12,799] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.pSpeedCPUAdam'>
[2024-06-29 15:22:12,799] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2024-06-29 15:22:12,800] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
[2024-06-29 15:22:12,947] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning
[2024-06-29 15:22:12,948] [INFO] [utils.py:782:see_memory_usage] MA 0.05 GB         Max_MA 4.35 GB         CA 0.06 GB         Max_CA 4 GB
[2024-06-29 15:22:12,948] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 28.09 GB, percent = 22.3%
[2024-06-29 15:22:12,955] [INFO] [stage3.py:130:__init__] Reduce bucket size 26214400
[2024-06-29 15:22:12,955] [INFO] [stage3.py:131:__init__] Prefetch bucket size 23592960
[2024-06-29 15:22:13,094] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-06-29 15:22:13,094] [INFO] [utils.py:782:see_memory_usage] MA 0.05 GB         Max_MA 0.05 GB         CA 0.06 GB         Max_CA 0 GB
[2024-06-29 15:22:13,094] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 28.09 GB, percent = 22.3%
Parameter Offload: Total persistent parameters: 10859520 in 361 params
[2024-06-29 15:22:15,623] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-06-29 15:22:15,624] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.05 GB         CA 0.06 GB         Max_CA 0 GB
[2024-06-29 15:22:15,624] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 28.14 GB, percent = 22.3%
[2024-06-29 15:22:15,774] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions
[2024-06-29 15:22:15,775] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.06 GB         Max_CA 0 GB
[2024-06-29 15:22:15,775] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 28.14 GB, percent = 22.3%
[2024-06-29 15:22:41,741] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 1
[2024-06-29 15:22:41,742] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.06 GB         Max_CA 0 GB
[2024-06-29 15:22:41,742] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 28.16 GB, percent = 22.4%
[2024-06-29 15:22:41,894] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions
[2024-06-29 15:22:41,895] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.06 GB         Max_CA 0 GB
[2024-06-29 15:22:41,895] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 28.16 GB, percent = 22.4%
[2024-06-29 15:22:42,060] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions
[2024-06-29 15:22:42,061] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.06 GB         Max_CA 0 GB
[2024-06-29 15:22:42,061] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 28.23 GB, percent = 22.4%
[2024-06-29 15:22:42,211] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-06-29 15:22:42,212] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.06 GB         Max_CA 0 GB
[2024-06-29 15:22:42,212] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 28.23 GB, percent = 22.4%
[2024-06-29 15:22:42,391] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-06-29 15:22:42,392] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.06 GB         Max_CA 0 GB
[2024-06-29 15:22:42,392] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 28.3 GB, percent = 22.5%
[2024-06-29 15:22:42,392] [INFO] [stage3.py:486:_setup_for_real_optimizer] optimizer state initialized
[2024-06-29 15:22:42,702] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-06-29 15:22:42,703] [INFO] [utils.py:782:see_memory_usage] MA 0.05 GB         Max_MA 0.05 GB         CA 0.11 GB         Max_CA 0 GB
[2024-06-29 15:22:42,703] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory:  used = 28.37 GB, percent = 22.5%
[2024-06-29 15:22:42,703] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3
[2024-06-29 15:22:42,703] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-06-29 15:22:42,703] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-06-29 15:22:42,703] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.999)]
[2024-06-29 15:22:42,707] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2024-06-29 15:22:42,707] [INFO] [config.py:1001:print]   activation_checkpointing_config  {
    "partition_activations": false,
    "contiguous_memory_optimization": false,
    "cpu_checkpointing": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
}
[2024-06-29 15:22:42,707] [INFO] [config.py:1001:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': Fa, 'overlap_events': True}
[2024-06-29 15:22:42,707] [INFO] [config.py:1001:print]   amp_enabled .................. False
[2024-06-29 15:22:42,707] [INFO] [config.py:1001:print]   amp_params ................... False
[2024-06-29 15:22:42,707] [INFO] [config.py:1001:print]   autotuning_config ............ {
    "enabled": false,
    "start_step": null,
    "end_step": null,
    "metric_path": null,
    "arg_mappings": null,
    "metric": "throughput",
    "model_info": null,
    "results_dir": "autotuning_results",
    "exps_dir": "autotuning_exps",
    "overwrite": true,
    "fast": true,
    "start_profile_step": 3,
    "end_profile_step": 5,
    "tuner_type": "gridsearch",
    "tuner_early_stopping": 5,
    "tuner_num_trials": 50,
    "model_info_path": null,
    "mp_size": 1,
    "max_train_batch_size": null,
    "min_train_batch_size": 1,
    "max_train_micro_batch_size_per_gpu": 1.024000e+03,
    "min_train_micro_batch_size_per_gpu": 1,
    "num_tuning_micro_batch_sizes": 3
}
[2024-06-29 15:22:42,707] [INFO] [config.py:1001:print]   bfloat16_enabled ............. False
[2024-06-29 15:22:42,707] [INFO] [config.py:1001:print]   bfloat16_immediate_grad_update  False
[2024-06-29 15:22:42,707] [INFO] [config.py:1001:print]   checkpoint_parallel_write_pipeline  False
[2024-06-29 15:22:42,707] [INFO] [config.py:1001:print]   checkpoint_tag_validation_enabled  True
[2024-06-29 15:22:42,707] [INFO] [config.py:1001:print]   checkpoint_tag_validation_fail  False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fdec1622980>
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   communication_data_type ...... None
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kern: False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'neare, 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantizatitype': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_gro': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_paramet': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   curriculum_enabled_legacy .... False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   curriculum_params_legacy ..... False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enab': False}}}}
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   data_efficiency_enabled ...... False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   dataloader_drop_last ......... False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   disable_allgather ............ False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   dump_state ................... False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'consecutive_hysesis': False, 'min_scale': 1}
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   eigenvalue_enabled ........... False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   eigenvalue_gas_boundary_resolution  1
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   eigenvalue_layer_num ......... 0
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   eigenvalue_max_iter .......... 100
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   eigenvalue_stability ......... 1e-06
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   eigenvalue_tol ............... 0.01
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   eigenvalue_verbose ........... False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   elasticity_enabled ........... False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   flops_profiler_config ........ {
    "enabled": false,
    "recompute_fwd_factor": 0.0,
    "profile_step": 1,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
}
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   fp16_auto_cast ............... False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   fp16_enabled ................. True
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   fp16_master_weights_and_gradients  False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   global_rank .................. 0
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   grad_accum_dtype ............. None
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   gradient_accumulation_steps .. 2
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   gradient_clipping ............ 1.0
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   gradient_predivide_factor .... 1.0
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   graph_harvesting ............. False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=Falpin_parameters=True tp_gather_partition_size=8
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   initial_dynamic_scale ........ 65536
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   load_universal_checkpoint .... False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   loss_scale ................... 0
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   memory_breakdown ............. False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   mics_hierarchial_params_gather  False
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   mics_shard_size .............. -1
[2024-06-29 15:22:42,708] [INFO] [config.py:1001:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, modone) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enablFalse
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   nebula_config ................ {
    "enabled": false,
    "persistent_storage_path": null,
    "persistent_time_interval": 100,
    "num_of_version_in_retention": 2,
    "enable_nebula_load": true,
    "load_path": null
}
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   optimizer_legacy_fusion ...... False
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   optimizer_name ............... None
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   optimizer_params ............. None
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpt_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   pld_enabled .................. False
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   pld_params ................... False
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   prescale_gradients ........... False
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   scheduler_name ............... None
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   scheduler_params ............. None
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   seq_parallel_communication_data_type  torch.float32
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   sparse_attention ............. None
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   sparse_gradients_enabled ..... False
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   steps_per_print .............. inf
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   timers_config ................ enabled=True synchronized=True
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   train_batch_size ............. 4
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   train_micro_batch_size_per_gpu  1
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   use_data_before_expert_parallel_  False
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   use_node_local_storage ....... False
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   wall_clock_breakdown ......... False
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   weight_quantization_config ... None
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   world_size ................... 2
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   zero_allow_untested_optimizer  True
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=2621440se_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False ofad_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimi=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=se, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=23592960 param_persistence_threshol1200 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True use_all_reduce_for_fetcarams=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_qtized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_lir=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   zero_enabled ................. True
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   zero_force_ds_cpu_optimizer .. True
[2024-06-29 15:22:42,709] [INFO] [config.py:1001:print]   zero_optimization_stage ...... 3
[2024-06-29 15:22:42,709] [INFO] [config.py:987:print_user_config]   json = {
    "train_batch_size": 4,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 2,
    "gradient_clipping": 1.0,
    "zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": false
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1.000000e+09,
        "reduce_bucket_size": 2.621440e+07,
        "stage3_prefetch_bucket_size": 2.359296e+07,
        "stage3_param_persistence_threshold": 5.120000e+04,
        "stage3_max_live_parameters": 1.000000e+09,
        "stage3_max_reuse_distance": 1.000000e+09,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "steps_per_print": inf
}
[INFO|trainer.py:2128] 2024-06-29 15:22:42,709 >> ***** Running training *****
[INFO|trainer.py:2129] 2024-06-29 15:22:42,709 >>   Num examples = 3,600
[INFO|trainer.py:2130] 2024-06-29 15:22:42,709 >>   Num Epochs = 5
[INFO|trainer.py:2131] 2024-06-29 15:22:42,709 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:2134] 2024-06-29 15:22:42,709 >>   Total train batch size (w. parallel, distributed & accumulation) = 4
[INFO|trainer.py:2135] 2024-06-29 15:22:42,709 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:2136] 2024-06-29 15:22:42,709 >>   Total optimization steps = 4,500
[INFO|trainer.py:2137] 2024-06-29 15:22:42,714 >>   Number of trainable parameters = 27,893,760
  0%|                                                                                                                                              | 0/4500 [00:00<?, ?it/root/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should bessed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current delt behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  warnings.warn(

  0%|                                                                                                                                  | 2/4500 [02:44<100:19:17, 80.29s/
  0%|▏                                                                                                                                  | 7/4500 [08:46<91:20:26, 73.19s/

{'loss': 1.2934, 'grad_norm': 1.2723101440923572, 'learning_rate': 2.2222222222222225e-06, 'epoch': 0.01}
{'loss': 1.19, 'grad_norm': 1.3058574250868817, 'learning_rate': 4.444444444444445e-06, 'epoch': 0.02}
  1%|▋                                                                                                                                 | 24/4500 [29:18<90:05:55, 72.47s/
{'loss': 1.1011, 'grad_norm': 1.4012755254348368, 'learning_rate': 6.666666666666667e-06, 'epoch': 0.03}
  1%|▉                                                                                                                                 | 31/4500 [37:45<89:56:10, 72.45s/

  1%|█                                                                                                                                 | 35/4500 [42:35<89:53:07, 72.47s/

  1%|█▏                                                                                                                                | 39/4500 [47:25<89:48:27, 72.47s/

{'loss': 0.898, 'grad_norm': 1.3420188309333259, 'learning_rate': 8.88888888888889e-06, 'epoch': 0.04}
  1%|█▎                                                                                                                                | 45/4500 [54:40<89:39:22, 72.45s/
  1%|▍                                      | 46/4500 [55:52<89:37:32, 72.44s/it]                                                                                                                                                                                                                                                                 {'loss': 0.684, 'grad_norm': 1.3847420371646724, 'learning_rate': 1.1111111111111112e-05, 'epoch': 0.06}
  1%|█▍                                                                         [INFO|trainer.py:3478] 2024-06-29 16:24:24,295 >> Saving model checkpoint to saves/qwen/l/sft/checkpoint-50
hiyouga commented 4 days ago

deepspeed z3 需要 nvlink 才能快

WangxuP commented 3 days ago

deepspeed z3 需要 nvlink 才能快

也就是说,常规网络下,造成这样的结果是正常的是吗?