ghosthamlet / gpt2-ml-torch

Pytorch model for https://github.com/imcaspar/gpt2-ml
Apache License 2.0
79 stars 16 forks source link

DeepSpeed finetune第二阶段报OVERFLOW,然后自动退出 #18

Closed youngshall closed 3 years ago

youngshall commented 3 years ago

更新:我尝试加载finetune_large_stage1_epoch_3去inference,发现结果是一堆无意义的文字。不知道是finetune坏了,还是我没有加载成功。 python gpt2_ml_torch/generate.py --prompt 宇宙的意义是 --max_len 300 --n_seq 3 --model_path ./models/finetune_large_stage1_epoch_3/

首先感谢您放出deepspeed微调的代码! 我直接用的是finetune_lm.py里面的两阶段finetune命令行,如下:

微调第一阶段:

deepspeed --num_nodes 1 --num_gpus 1 finetune_lm.py --log_name finetune_large_stage1 --seq_len 1024 --epochs 3 --batch_size 1 --lr 5e-8 --device_ids 0 --train_data datasets/a_train.txt --valid_data datasets/a_val.txt --pretrained_path models/mega-clue-tok --freeze_body

微调第二阶段:

deepspeed --num_nodes 1 --num_gpus 1 finetune_lm.py --log_name finetune_large_stage2 --seq_len 1024 --epochs 10 --batch_size 1 --lr 5e-8 --device_ids 0 --train_data datasets/a_train.txt --valid_data datasets/a_val.txt --pretrained_path models/finetune_large_stage1_epoch_3

a_train.txt与a_val.txt大小分别是18.5M和0.9M。 第一阶段可以顺利完成,models/finetune_large_stage1_epoch_3也成功保存了。 但第二阶段开始不久后,就报错退出。错误如下: ....(多行重复) [2021-02-25 01:12:33,394] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0 [2021-02-25 01:12:34,032] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0

我的硬件是RTX 3090 24G x 1 + 64G内存。我仔细看了一下显存和内存占用,nvidia-smi中显示的显存占用是19G/24G,并没有超出;内存占用大概是40G/64G,也没有超出。

我尝试把seq_len改回300,以及将a_train.txt大小减半,以及把deepspeed==0.3.7升级为deepspeed==0.3.11,情况还是一样。

现在我没什么头绪,只能请教作者和各位大神了。不胜感激!

以下是第二阶段的完整输出:

root@4a2fe6fa9bc8:/transformer/gpt2-ml-torch# bash ./run_fine_tune_a.sh [2021-02-25 01:11:41,137] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2021-02-25 01:11:41,149] [INFO] [runner.py:358:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 finetune_lm.py --log_name finetune_large_stage2 --seq_len 300 --epochs 10 --batch_size 1 --lr 5e-8 --device_ids 0 --train_data datasets/a_train.txt --valid_data datasets/a_val.txt --pretrained_path models/finetune_large_stage1_epoch_3 [2021-02-25 01:11:41,576] [INFO] [launch.py:71:main] 0 NCCL_VERSION 2.8.3 [2021-02-25 01:11:41,576] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0]} [2021-02-25 01:11:41,576] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=1, node_rank=0 [2021-02-25 01:11:41,576] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2021-02-25 01:11:41,576] [INFO] [launch.py:100:main] dist_world_size=1 [2021-02-25 01:11:41,576] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0 2021-02-25 01:11:42,320 - INFO - {"NPP_VERSION": "11.1.2.301", "NVIDIA_VISIBLE_DEVICES": "all", "DALI_BUILD": "1758882", "CUSOLVER_VERSION": "11.0.1.105", "NVM_INC": "/usr/local/nvm/versions/node/v15.2.1/include/node", "CUBLAS_VERSION": "11.3.0.106", "HOSTNAME": "4a2fe6fa9bc8", "NVIDIA_REQUIRE_CUDA": "cuda>=9.0", "CUFFT_VERSION": "10.3.0.105", "CUDA_CACHE_DISABLE": "1", "TENSORBOARD_PORT": "6006", "_CUDA_COMPAT_STATUS": "CUDA Driver UNAVAILABLE (cuInit(0) returned 803)", "TORCH_CUDA_ARCH_LIST": "5.2 6.0 6.1 7.0 7.5 8.0 8.6+PTX", "NCCL_VERSION": "2.8.3", "CUSPARSE_VERSION": "11.3.0.10", "ENV": "/etc/shinit_v2", "PWD": "/transformer/gpt2-ml-torch", "OPENUCX_VERSION": "1.9.0", "NSIGHT_SYSTEMS_VERSION": "2020.3.4.32", "NVIDIA_DRIVER_CAPABILITIES": "compute,utility,video", "OMPI_MCA_pml": "^ucx", "TRT_VERSION": "7.2.2.1", "HOME": "/root", "COCOAPI_VERSION": "2.0+nv0.4.0", "CUDA_VERSION": "11.1.1.002", "PYTORCH_VERSION": "1.8.0a0+1606899", "CURAND_VERSION": "10.2.2.105", "PYTORCH_BUILD_NUMBER": "0", "DLPROF_VERSION": "20.12", "NVM_DIR": "/usr/local/nvm", "LESSCLOSE": "/usr/bin/lesspipe %s %s", "PYTHONPATH": ":/transformer/gpt2-ml-torch/", "TERM": "xterm", "LESSOPEN": "| /usr/bin/lesspipe %s", "OPENMPI_VERSION": "4.0.5", "NVJPEG_VERSION": "11.3.0.105", "LIBRARY_PATH": "/usr/local/cuda/lib64/stubs:", "PYTHONIOENCODING": "utf-8", "SHLVL": "2", "NVM_CD_FLAGS": "", "BASH_ENV": "/etc/bash.bashrc", "CUDNN_VERSION": "8.0.5.43", "NSIGHT_COMPUTE_VERSION": "2020.2.1.8", "DALI_VERSION": "0.28.0", "JUPYTER_PORT": "8888", "LD_LIBRARY_PATH": "/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64", "NVIDIA_BUILD_ID": "17950526", "CUDA_DRIVER_VERSION": "455.32.00", "LC_ALL": "C.UTF-8", "PYTORCH_BUILD_VERSION": "1.8.0a0+1606899", "_CUDA_COMPAT_PATH": "/usr/local/cuda/compat", "PATH": "/usr/local/nvm/versions/node/v15.2.1/bin:/opt/conda/bin:/opt/cmake-3.14.6-Linux-x86_64/bin/:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin", "MOFED_VERSION": "5.1-2.3.7", "NVM_BIN": "/usr/local/nvm/versions/node/v15.2.1/bin", "NVIDIA_PYTORCH_VERSION": "20.12", "TRTOSSVERSION": "20.12", "OLDPWD": "/workspace", "": "/opt/conda/bin/deepspeed", "CRC32C_SW_MODE": "auto", "CUDA_VISIBLE_DEVICES": "0", "MASTER_ADDR": "127.0.0.1", "MASTER_PORT": "29500", "WORLD_SIZE": "1", "RANK": "0", "LOCAL_RANK": "0", "KMP_DUPLICATE_LIB_OK": "True", "KMP_INIT_AT_FORK": "FALSE"} 2021-02-25 01:11:42,320 - INFO - { "lr": 5e-08, "warmup_steps": 200, "gradient_accumulation_steps": 1, "model_config": "configs/small.json", "vocab": "models/mega-clue-tok/vocab.txt", "pretrained_path": "models/finetune_large_stage1_epoch_3", "train_data": "datasets/a_train.txt", "valid_data": "datasets/a_val.txt", "freeze_body": false, "max_data_len": null, "log_name": "finetune_large_stage2", "no_cache": false, "device_ids": "0", "no_cuda": false, "seq_len": 300, "epochs": 10, "batch_size": 1, "seed": 62, "local_rank": 0, "deepspeed": true, "deepspeed_config": null, "deepscale": false, "deepscale_config": null, "deepspeed_mpi": false, "cpu_optimizer": true, "rank": 0, "world_size": 1, "cuda": true, "device": "cuda" } 2021-02-25 01:11:42,320 - INFO - { "activation_function": "gelu", "attn_pdrop": 0.1, "bos_token_id": 50256, "embd_pdrop": 0.1, "eos_token_id": 50256, "initializer_range": 0.014142135623731, "layer_norm_epsilon": 1e-05, "model_type": "gpt2", "n_ctx": 1024, "n_embd": 1536, "n_head": 24, "n_layer": 48, "n_positions": 1024, "resid_pdrop": 0.1, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true, "vocab_size": 8021 } 2021-02-25 01:11:42,320 - INFO - { "zero_optimization": { "stage": 2, "cpu_offload": true, "contiguous_gradients": true, "overlap_comm": false, "reduce_bucket_size": 3000000, "allgather_bucket_size": 3000000 }, "train_batch_size": 1, "gradient_accumulation_steps": 1, "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "activation_checkpointing": { "partition_activations": true, "contiguous_memory_optimization": true, "cpu_checkpointing": true }, "wall_clock_breakdown": false } Using /root/.cache/torch_extensions as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -UCUDA_NO_HALF_OPERATORS -UCUDA_NO_HALF_CONVERSIONS -UCUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /opt/conda/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o [2/2] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -L/opt/conda/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so Loading extension module cpu_adam... Time to load cpu_adam op: 11.430184841156006 seconds Adam Optimizer #0 is created with scalar arithmetic capability. Config: alpha=0.000000, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1 [2021-02-25 01:12:15,240] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.11, git-hash=unknown, git-branch=unknown [2021-02-25 01:12:15,251] [INFO] [engine.py:73:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1 [2021-02-25 01:12:15,272] [INFO] [engine.py:547:_configure_optimizer] Using client Optimizer as basic optimizer [2021-02-25 01:12:15,272] [INFO] [engine.py:556:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam ( Parameter Group 0 amsgrad: False betas: (0.9, 0.999) bias_correction: True eps: 1e-08 initial_lr: 5e-08 lr: 0.0 weight_decay: 0.01 ) Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2021-02-25 01:12:15,272] [INFO] [engine.py:672:_configure_zero_optimizer] Creating fp16 ZeRO stage 2 optimizer Using /root/.cache/torch_extensions as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.24039649963378906 seconds [2021-02-25 01:12:15,513] [INFO] [stage2.py:130:init] Reduce bucket size 3000000 [2021-02-25 01:12:15,513] [INFO] [stage2.py:131:init] Allgather bucket size 3000000 [2021-02-25 01:12:15,513] [INFO] [stage2.py:132:init] CPU Offload: True group 0 param 0 = 1373809152 [2021-02-25 01:12:22,227] [INFO] [stage2.py:399:init__] optimizer state initialized [2021-02-25 01:12:22,228] [INFO] [engine.py:586:_configure_optimizer] DeepSpeed Final Optimizer = <deepspeed.runtime.zero.stage2.FP16_DeepSpeedZeroOptimizer object at 0x7f9534788610> [2021-02-25 01:12:22,228] [INFO] [engine.py:410:_configure_lr_scheduler] DeepSpeed using client LR scheduler [2021-02-25 01:12:22,228] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f958e1ba550> [2021-02-25 01:12:22,228] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.999)] [2021-02-25 01:12:22,228] [INFO] [config.py:733:print] DeepSpeedEngine configuration: [2021-02-25 01:12:22,228] [INFO] [config.py:737:print] activation_checkpointing_config <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7f95346f1250> [2021-02-25 01:12:22,228] [INFO] [config.py:737:print] allreduce_always_fp32 ........ False [2021-02-25 01:12:22,228] [INFO] [config.py:737:print] amp_enabled .................. False [2021-02-25 01:12:22,228] [INFO] [config.py:737:print] amp_params ................... False [2021-02-25 01:12:22,228] [INFO] [config.py:737:print] checkpoint_tag_validation_enabled True [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] checkpoint_tag_validation_fail False [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] disable_allgather ............ False [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] dump_state ................... False [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1} [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] elasticity_enabled ........... False [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] flops_profiler_config ........ <deepspeed.profiling.config.DeepSpeedFlopsProfilerConfig object at 0x7f95346f1670> [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] fp16_enabled ................. True [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] global_rank .................. 0 [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] gradient_accumulation_steps .. 1 [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] gradient_clipping ............ 0.0 [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] gradient_predivide_factor .... 1.0 [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] initial_dynamic_scale ........ 4294967296 [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] loss_scale ................... 0 [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] memory_breakdown ............. False [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] optimizer_legacy_fusion ...... False [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] optimizer_name ............... None [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] optimizer_params ............. None [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] pld_enabled .................. False [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] pld_params ................... False [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] prescale_gradients ........... False [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] scheduler_name ............... None [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] scheduler_params ............. None [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] sparse_attention ............. None [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] sparse_gradients_enabled ..... False [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] steps_per_print .............. 10 [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] tensorboard_enabled .......... False [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] tensorboard_job_name ......... DeepSpeedJobName [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] tensorboard_output_path ...... [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] train_batch_size ............. 1 [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] train_micro_batch_size_per_gpu 1 [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] wall_clock_breakdown ......... False [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] world_size ................... 1 [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] zero_allow_untested_optimizer False [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] zero_config .................. { "allgather_bucket_size": 3000000, "allgather_partitions": true, "contiguous_gradients": true, "cpu_offload": true, "elastic_checkpoint": true, "load_from_fp32_weights": true, "overlap_comm": false, "reduce_bucket_size": 3000000, "reduce_scatter": true, "stage": 2 } [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] zero_enabled ................. True [2021-02-25 01:12:22,229] [INFO] [config.py:737:print] zero_optimization_stage ...... 2 [2021-02-25 01:12:22,230] [INFO] [config.py:739:print] json = { "activation_checkpointing":{ "contiguous_memory_optimization":true, "cpu_checkpointing":true, "partition_activations":true }, "fp16":{ "enabled":true, "hysteresis":2, "loss_scale":0, "loss_scale_window":1000, "min_loss_scale":1 }, "gradient_accumulation_steps":1, "train_batch_size":1, "wall_clock_breakdown":false, "zero_optimization":{ "allgather_bucket_size":3000000, "contiguous_gradients":true, "cpu_offload":true, "overlap_comm":false, "reduce_bucket_size":3000000, "stage":2 } } Using /root/.cache/torch_extensions as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.00031566619873046875 seconds 2021-02-25 01:12:22,230 - INFO - Epoch 1 2021-02-25 01:12:22,230 - INFO - Epoch 1 [2021-02-25 01:12:23,191] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 4294967296 2021-02-25 01:12:23,192 - INFO - Train Epoch: 1 [1/8141 (0%)] Loss: 7.765625 2021-02-25 01:12:23,192 - INFO - Train Epoch: 1 [1/8141 (0%)] Loss: 7.765625 [2021-02-25 01:12:23,831] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0 [2021-02-25 01:12:24,462] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0 [2021-02-25 01:12:25,101] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0 [2021-02-25 01:12:25,734] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0 [2021-02-25 01:12:26,381] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0 [2021-02-25 01:12:27,017] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0 [2021-02-25 01:12:27,650] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0 [2021-02-25 01:12:28,288] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0 [2021-02-25 01:12:28,929] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0 [2021-02-25 01:12:28,930] [INFO] [timer.py:163:stop] 0/10, SamplesPerSec=1.5696152153323526 [2021-02-25 01:12:29,563] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0 2021-02-25 01:12:29,564 - INFO - Train Epoch: 1 [11/8141 (0%)] Loss: 7.531250 2021-02-25 01:12:29,564 - INFO - Train Epoch: 1 [11/8141 (0%)] Loss: 7.531250 [2021-02-25 01:12:30,200] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0 [2021-02-25 01:12:30,841] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0 [2021-02-25 01:12:31,480] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0 [2021-02-25 01:12:32,116] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0 [2021-02-25 01:12:32,758] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0 [2021-02-25 01:12:33,394] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0 [2021-02-25 01:12:34,032] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0 root@4a2fe6fa9bc8:/transformer/gpt2-ml-torch#

ghosthamlet commented 3 years ago
  1. finetune_large_stage1_epoch_3去inference效果不行可能是第一阶段微调有问题,微调结束后的loss是多少?如果是3以上,那说明有问题,再仔细检查一下你输入的数据,很可能是数据问题

  2. 下面的提示不影响训练,指的是权重 OVERFLOW,不是显存或内存,Attempted loss scale说的是DeepSpeed自动下调了loss scale,继续训练权重很快就会恢复正常: [2021-02-25 01:12:34,032] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0

更新: 自动退出前除了上面的内容没有其他提示吗?有可能是硬件不稳定,试试重新运行几遍训练

youngshall commented 3 years ago

谢谢回复! 我仔细看了一下log,发现第一阶段微调的loss确实有问题,越微调越高,一路升高到7~9。 为了排除数据问题,我直接用test_train.txt和test_val.txt,看log还是同样的表现,在第一个epoch马上loss就飙升了: 2021-02-26 04:48:52,005 - INFO - Train Epoch: 1 [1/74 (1%)] Loss: 2.882812 2021-02-26 04:48:52,005 - INFO - Train Epoch: 1 [1/74 (1%)] Loss: 2.882812 2021-02-26 04:48:52,695 - INFO - Train Epoch: 1 [11/74 (15%)] Loss: 3.199219 2021-02-26 04:48:52,695 - INFO - Train Epoch: 1 [11/74 (15%)] Loss: 3.199219 2021-02-26 04:48:53,417 - INFO - Train Epoch: 1 [21/74 (28%)] Loss: 7.734375 2021-02-26 04:48:53,417 - INFO - Train Epoch: 1 [21/74 (28%)] Loss: 7.734375 2021-02-26 04:48:54,355 - INFO - Train Epoch: 1 [31/74 (42%)] Loss: 8.117188 2021-02-26 04:48:54,355 - INFO - Train Epoch: 1 [31/74 (42%)] Loss: 8.117188 2021-02-26 04:48:55,266 - INFO - Train Epoch: 1 [41/74 (55%)] Loss: 7.769531

第二阶段自动退出的问题,我重复10+次、重启电脑也是一样的表现。如果是硬件或库的问题,只能考虑是pytorch版本和transformers版本不兼容之类了,因为显卡是新的3090,安装环境很麻烦,我直接在nvidia ngc下载的pytorch docker容器,pip list显示的pytorch版本是torch 1.8.0a0+1606899。 现在只能先假设是第一阶段finetune失败导致的问题。

ghosthamlet commented 3 years ago

不好意思,问题1我说的不准确,第一阶段loss会比较高,训练刚开始升高是正常的,只要后面逐渐下降就没问题,具体数值和你的数据集大小和质量有关。我训练的数据集比较大,所以第一阶段loss在3以下。如果loss是3以上,inference效果确实会不好,需要靠第二阶段训练到3以下才能有正常效果。 另外gpt2-ml-torch代码库内自带的test_train.txt数据太少只用来测试代码正确性,不能检验训练效果,需要观察一下你自己数据训练的loss变化。 第二阶段自动退出因为没报任何与退出有关的错误,可能和docker有关,可以观察一下docker状态和查看它的log. 你的第一阶段和第二阶段finetune运行都没有问题,所以退出不是版本不兼容的问题,第一阶段finetune失败只可能导致第二阶段不能收敛,不会导致退出。

youngshall commented 3 years ago

无论是test_train.txt还是我自己18M的train.txt,看log都是一直维持在7~9的loss不下降。所以感觉是微调失败了。 您用多大的train.txt呢? 我又入手了个48G的RTX 8000,看看能不能不要deepspeed,直接用pytorch单精度微调。

ghosthamlet commented 3 years ago

我的数据是200M以上。你的数据只有18M,如果loss在7左右,也可能是正常的,毕竟第一阶段只微调最后一层权重,如果数据本身格式或质量没有问题,第二阶段微调所有权重后也可以降到3之内。 我的GPU是1080ti和2080ti,没测试过3090和8000,后两款的fp16微调可能会有所不同。 不过最后要看你第二阶段微调能正常运行后的收敛情况,才能确定是什么问题了。

NLPIG commented 3 years ago

不好意思,问题1我说的不准确,第一阶段loss会比较高,训练刚开始升高是正常的,只要后面逐渐下降就没问题,具体数值和你的数据集大小和质量有关。我训练的数据集比较大,所以第一阶段loss在3以下。如果loss是3以上,inference效果确实会不好,需要靠第二阶段训练到3以下才能有正常效果。 另外gpt2-ml-torch代码库内自带的test_train.txt数据太少只用来测试代码正确性,不能检验训练效果,需要观察一下你自己数据训练的loss变化。 第二阶段自动退出因为没报任何与退出有关的错误,可能和docker有关,可以观察一下docker状态和查看它的log. 你的第一阶段和第二阶段finetune运行都没有问题,所以退出不是版本不兼容的问题,第一阶段finetune失败只可能导致第二阶段不能收敛,不会导致退出。

作者您好,在您的帮助下我已经微调成功(3080ti 12G+内存 94G),第一阶段loss==2.7,generate也取得不错的效果,请问我继续进行第二阶段的训练,假设loss在2以下,会不会导致生成结果直接变成训练语料(相当于训练过度变成背诵课文的样子)?另外,我发现stage2训练很慢,第一阶段100个Epoch好像不到一小时就完成(语料250MB左右),第二阶段数据分成53万份,我换算了下需要29天连续不断的训练才能完成一个Epoch。请问:为什么阶段二速度会慢这么多呢?

ghosthamlet commented 3 years ago

第一阶段训练只微调模型最后一层参数,冻结了所有其他层,所以训练速度很快。 第二阶段微调所有层的15亿参数,慢这么多是正常的。考虑到你的语料不算小,loss 2.7要降到2以下即使在第二阶段也是非常困难的,必要性也不大,只要能降到2.3左右效果就十分好了,不一定要训练完一个Epoch,最好加上Tensorboard监控一下loss的变化,可以及时终止训练。

NLPIG commented 3 years ago

第一阶段训练只微调模型最后一层参数,冻结了所有其他层,所以训练速度很快。 第二阶段微调所有层的15亿参数,慢这么多是正常的。考虑到你的语料不算小,loss 2.7要降到2以下即使在第二阶段也是非常困难的,必要性也不大,只要能降到2.3左右效果就十分好了,不一定要训练完一个Epoch,最好加上Tensorboard监控一下loss的变化,可以及时终止训练。

好的,请问,1、目前显示的是【Epoch 1:[1821/522842][0%]】,意思是数据被分成52.3万份,现在只训练了1821份吗?或者是其他意思?2、“不一定要训练完一个Epoch,可以及时终止训练。”终止训练的话,模型是保存在第一阶段训练完生成的这个文件夹里吗(相当于自动覆盖)?因为我发现第一阶段每个Epoch会生成一个文件夹保存模型,目前第二阶段第一个Epoch还没结束。

ghosthamlet commented 3 years ago

1、你说的没错 2、文件夹名是以命令行输入的参数--log_name为前缀的,所以只要第二阶段的--log_name和第一阶段不同,就不会覆盖。一个Epoch完成后会保存一次模型,文件名是pytorch_model.bin;但Epoch中间每60000个batch也会保存一次,所以才可以在训练60000个batch后提前终止,保存在同一个文件夹,文件名是pytorch_model_60000.bin,名字中的60000就是训练时的batch值,在generate时要先把pytorch_model_60000.bin文件名改成pytorch_model.bin。问题1中的1821那个数字除于你训练用的--batch_size参数就是batch值。 另外前面你说的训练过度变成背诵课文的问题,如果你指定了--valid_data足够大和随机的验证数据集,而且训练完一个epoch,那么可以看到有Eval (或Valid,指的是同一个)验证数据集的loss,如果不大于Train的loss 0.2左右,那过度拟合问题就不大;如果小于Train loss,那可能是欠拟合,训练时间还不够,但只要相差不大,也没问题。如果用了Tensorboard监控,可以对比loss曲线,那样更准确。

NLPIG commented 3 years ago

1、你说的没错 2、文件夹名是以命令行输入的参数--log_name为前缀的,所以只要第二阶段的--log_name和第一阶段不同,就不会覆盖。一个Epoch完成后会保存一次模型,文件名是pytorch_model.bin;但Epoch中间每60000个batch也会保存一次,所以才可以在训练60000个batch后提前终止,保存在同一个文件夹,文件名是pytorch_model_60000.bin,名字中的60000就是训练时的batch值,在generate时要先把pytorch_model_60000.bin文件名改成pytorch_model.bin。问题1中的1821那个数字除于你训练用的--batch_size参数就是batch值。 另外前面你说的训练过度变成背诵课文的问题,如果你指定了--valid_data足够大和随机的验证数据集,而且训练完一个epoch,那么可以看到有Eval (或Valid,指的是同一个)验证数据集的loss,如果不大于Train的loss 0.2左右,那过度拟合问题就不大;如果小于Train loss,那可能是欠拟合,训练时间还不够,但只要相差不大,也没问题。如果用了Tensorboard监控,可以对比loss曲线,那样更准确。

好的,谢谢!现在2.4左右了,过度拟合的问题我也不用担心了哈哈,真的感谢!

ghosthamlet commented 3 years ago

不客气,这个就关闭了,有问题可以新开issue。