hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.62k stars 4.33k forks source link

[BUG]: ColossalAI/examples/language/gpt/gemini$ bash run_gemini.sh failed! #3858

Open SeekPoint opened 1 year ago

SeekPoint commented 1 year ago

🐛 Describe the bug

I followed every steps on the example. but stilled failed

(gh_ColossalAI_examples_language_gpt_gemini) r730ub20@r730ub20-M0:~/llm_dev/ColossalAI/examples/language/gpt/gemini$ (gh_ColossalAI_examples_language_gpt_gemini) r730ub20@r730ub20-M0:~/llm_dev/ColossalAI/examples/language/gpt/gemini$ bash run_gemini.sh

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/r730ub20/llm_dev/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:352 in │ │ │ │ │ │ 349 │ │ 350 │ │ 351 if name == 'main': │ │ ❱ 352 │ main() │ │ 353 │ │ │ │ /home/r730ub20/llm_dev/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:254 in main │ │ │ │ 251 │ │ │ raise RuntimeError │ │ 252 │ │ │ │ 253 │ │ # build a highly optimized gpu/cpu optimizer │ │ ❱ 254 │ │ optimizer = HybridAdam(model.parameters(), lr=1e-3) │ │ 255 │ │ │ │ 256 │ │ if args.distplan == "CAI_ZeRO1": │ │ 257 │ │ │ zero_stage = 1 │ │ │ │ /home/r730ub20/.local/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py:82 in │ │ init │ │ │ │ 79 │ │ self.adamw_mode = adamw_mode │ │ 80 │ │ │ │ 81 │ │ # build during runtime if not found │ │ ❱ 82 │ │ cpu_optim = CPUAdamBuilder().load() │ │ 83 │ │ fused_optim = FusedOptimBuilder().load() │ │ 84 │ │ self.cpu_adam_op = cpu_optim.CPUAdamOptimizer(lr, betas[0], betas[1], eps, weigh │ │ 85 │ │ │ │ /home/r730ub20/.local/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py:167 in │ │ load │ │ │ │ 164 │ │ │ │ │ f"[extension] OP {self.prebuilt_import_path} has been compiled ahead │ │ 165 │ │ except ImportError: │ │ 166 │ │ │ # check environment │ │ ❱ 167 │ │ │ self.check_runtime_build_environment() │ │ 168 │ │ │ │ │ 169 │ │ │ # time the kernel compilation │ │ 170 │ │ │ start_build = time.time() │ │ │ │ /home/r730ub20/.local/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py:139 in │ │ check_runtime_build_environment │ │ │ │ 136 │ │ │ raise RuntimeError("CUDA is not available on your system as torch.cuda.is_av │ │ 137 │ │ │ │ 138 │ │ # make sure system CUDA and pytorch CUDA match, an error will raised inside the │ │ ❱ 139 │ │ check_system_pytorch_cuda_match(CUDA_HOME) │ │ 140 │ │ │ 141 │ def load(self, verbose: Optional[bool] = None): │ │ 142 │ │ """ │ │ │ │ /home/r730ub20/.local/lib/python3.8/site-packages/colossalai/kernel/op_builder/utils.py:87 in │ │ check_system_pytorch_cuda_match │ │ │ │ 84 │ torch_cuda_major, torch_cuda_minor = get_cuda_version_in_pytorch() │ │ 85 │ │ │ 86 │ if bare_metal_major != torch_cuda_major: │ │ ❱ 87 │ │ raise Exception( │ │ 88 │ │ │ f'[extension] Failed to build PyTorch extension because the detected CUDA ve │ │ 89 │ │ │ f'mismatches the version that was used to compile PyTorch ({torch_cuda_major │ │ 90 │ │ │ 'Please make sure you have set the CUDA_HOME correctly and installed the cor │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ Exception: [extension] Failed to build PyTorch extension because the detected CUDA version (11.7) mismatches the version that was used to compile PyTorch (10.2).Please make sure you have set the CUDA_HOME correctly and installed the correct PyTorch in https://pytorch.org/get-started/locally/ . ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27053) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/r730ub20/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/r730ub20/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/home/r730ub20/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/r730ub20/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/r730ub20/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/r730ub20/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./train_gpt_demo.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-05-26_16:07:08 host : r730ub20-M0 rank : 0 (local_rank: 0) exitcode : 1 (pid: 27053) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ (gh_ColossalAI_examples_language_gpt_gemini) r730ub20@r730ub20-M0:~/llm_dev/ColossalAI/examples/language/gpt/gemini$ ### Environment _No response_
flybird11111 commented 1 year ago

🐛 Describe the bug

I followed every steps on the example. but stilled failed

(gh_ColossalAI_examples_language_gpt_gemini) r730ub20@r730ub20-M0:~/llm_dev/ColossalAI/examples/language/gpt/gemini$ (gh_ColossalAI_examples_language_gpt_gemini) r730ub20@r730ub20-M0:~/llm_dev/ColossalAI/examples/language/gpt/gemini$ bash run_gemini.sh

  • export DISTPLAN=CAI_Gemini
  • DISTPLAN=CAI_Gemini
  • export GPUNUM=1
  • GPUNUM=1
  • export TPDEGREE=1
  • TPDEGREE=1
  • export PLACEMENT=cpu
  • PLACEMENT=cpu
  • export USE_SHARD_INIT=False
  • USE_SHARD_INIT=False
  • export BATCH_SIZE=16
  • BATCH_SIZE=16
  • export MODEL_TYPE=gpt2_medium
  • MODEL_TYPE=gpt2_medium
  • export TRAIN_STEP=10
  • TRAIN_STEP=10
  • '[' False = True ']'
  • USE_SHARD_INIT=
  • mkdir -p gemini_logs
  • torchrun --standalone --nproc_per_node=1 ./train_gpt_demo.py --tp_degree=1 --model_type=gpt2_medium --batch_size=16 --placement=cpu --distplan=CAI_Gemini --train_step=10
  • tee ./gemini_logs/gpt2_medium_CAI_Gemini_gpu_1_bs_16_tp_1_cpu.log environmental variable OMP_NUM_THREADS is set to 56. [05/26/23 16:06:57] INFO colossalai - colossalai - INFO: /home/r730ub20/.local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:522 set_device INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0 [05/26/23 16:07:00] INFO colossalai - colossalai - INFO: /home/r730ub20/.local/lib/python3.8/site-packages/colossalai/context/parallel_context.py:558 set_seed INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA. INFO colossalai - colossalai - INFO: /home/r730ub20/.local/lib/python3.8/site-packages/colossalai/initialize.py:115 launch INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1 INFO colossalai - colossalai - INFO: ./train_gpt_demo.py:210 main INFO colossalai - colossalai - INFO: gpt2_medium, CAI_Gemini, batch size 16 ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/r730ub20/.local/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py:161 in │ │ load │ │ │ │ 158 │ │ try: │ │ 159 │ │ │ # if the kernel has been pre-built during installation │ │ 160 │ │ │ # we just directly import it │ │ ❱ 161 │ │ │ op_module = self.import_op() │ │ 162 │ │ │ if verbose: │ │ 163 │ │ │ │ print_rank_0( │ │ 164 │ │ │ │ │ f"[extension] OP {self.prebuilt_import_path} has been compiled ahead │ │ │ │ /home/r730ub20/.local/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py:110 in │ │ import_op │ │ │ │ 107 │ │ """ │ │ 108 │ │ This function will import the op module by its string name. │ │ 109 │ │ """ │ │ ❱ 110 │ │ return importlib.import_module(self.prebuilt_import_path) │ │ 111 │ │ │ 112 │ def check_runtime_build_environment(self): │ │ 113 │ │ """ │ │ │ │ /usr/lib/python3.8/importlib/init.py:127 in import_module │ │ │ │ 124 │ │ │ if character != '.': │ │ 125 │ │ │ │ break │ │ 126 │ │ │ level += 1 │ │ ❱ 127 │ return _bootstrap._gcd_import(name[level:], package, level) │ │ 128 │ │ 129 │ │ 130 _RELOADING = {} │ │ in _gcd_import:1014 │ │ in _find_and_load:991 │ │ in _find_and_load_unlocked:973 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ModuleNotFoundError: No module named 'colossalai._C.cpu_adam'

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮

│ /home/r730ub20/llm_dev/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:352 in │ │ │ │ │ │ 349 │ │ 350 │ │ 351 if name == 'main': │ │ ❱ 352 │ main() │ │ 353 │ │ │ │ /home/r730ub20/llm_dev/ColossalAI/examples/language/gpt/gemini/./train_gpt_demo.py:254 in main │ │ │ │ 251 │ │ │ raise RuntimeError │ │ 252 │ │ │ │ 253 │ │ # build a highly optimized gpu/cpu optimizer │ │ ❱ 254 │ │ optimizer = HybridAdam(model.parameters(), lr=1e-3) │ │ 255 │ │ │ │ 256 │ │ if args.distplan == "CAI_ZeRO1": │ │ 257 │ │ │ zero_stage = 1 │ │ │ │ /home/r730ub20/.local/lib/python3.8/site-packages/colossalai/nn/optimizer/hybrid_adam.py:82 in │ │ init │ │ │ │ 79 │ │ self.adamw_mode = adamw_mode │ │ 80 │ │ │ │ 81 │ │ # build during runtime if not found │ │ ❱ 82 │ │ cpu_optim = CPUAdamBuilder().load() │ │ 83 │ │ fused_optim = FusedOptimBuilder().load() │ │ 84 │ │ self.cpu_adam_op = cpu_optim.CPUAdamOptimizer(lr, betas[0], betas[1], eps, weigh │ │ 85 │ │ │ │ /home/r730ub20/.local/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py:167 in │ │ load │ │ │ │ 164 │ │ │ │ │ f"[extension] OP {self.prebuilt_import_path} has been compiled ahead │ │ 165 │ │ except ImportError: │ │ 166 │ │ │ # check environment │ │ ❱ 167 │ │ │ self.check_runtime_build_environment() │ │ 168 │ │ │ │ │ 169 │ │ │ # time the kernel compilation │ │ 170 │ │ │ start_build = time.time() │ │ │ │ /home/r730ub20/.local/lib/python3.8/site-packages/colossalai/kernel/op_builder/builder.py:139 in │ │ check_runtime_build_environment │ │ │ │ 136 │ │ │ raise RuntimeError("CUDA is not available on your system as torch.cuda.is_av │ │ 137 │ │ │ │ 138 │ │ # make sure system CUDA and pytorch CUDA match, an error will raised inside the │ │ ❱ 139 │ │ check_system_pytorch_cuda_match(CUDA_HOME) │ │ 140 │ │ │ 141 │ def load(self, verbose: Optional[bool] = None): │ │ 142 │ │ """ │ │ │ │ /home/r730ub20/.local/lib/python3.8/site-packages/colossalai/kernel/op_builder/utils.py:87 in │ │ check_system_pytorch_cuda_match │ │ │ │ 84 │ torch_cuda_major, torch_cuda_minor = get_cuda_version_in_pytorch() │ │ 85 │ │ │ 86 │ if bare_metal_major != torch_cuda_major: │ │ ❱ 87 │ │ raise Exception( │ │ 88 │ │ │ f'[extension] Failed to build PyTorch extension because the detected CUDA ve │ │ 89 │ │ │ f'mismatches the version that was used to compile PyTorch ({torch_cuda_major │ │ 90 │ │ │ 'Please make sure you have set the CUDA_HOME correctly and installed the cor │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ Exception: [extension] Failed to build PyTorch extension because the detected CUDA version (11.7) mismatches the version that was used to compile PyTorch (10.2).Please make sure you have set the CUDA_HOME correctly and installed the correct PyTorch in https://pytorch.org/get-started/locally/ . ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27053) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/r730ub20/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/r730ub20/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, kwargs) File "/home/r730ub20/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/r730ub20/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/r730ub20/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/r730ub20/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./train_gpt_demo.py FAILED

Failures:

# Root Cause (first observed failure): [0]: time : 2023-05-26_16:07:08 host : r730ub20-M0 rank : 0 (local_rank: 0) exitcode : 1 (pid: 27053) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html (gh_ColossalAI_examples_language_gpt_gemini) r730ub20@r730ub20-M0:~/llm_dev/ColossalAI/examples/language/gpt/gemini$ ### Environment _No response_

CUDA is not available on your system.

tiansiyuan commented 1 year ago

Got the same issue while running: bash run_gemini.sh.

Error messages include:

Exception: [extension] Failed to build PyTorch extension because the detected CUDA version (12.1) mismatches the version that was used to compile PyTorch (11.3).Please make sure you have set the CUDA_HOME correctly and installed the correct PyTorch in https://pytorch.org/get-started/locally/ .

The ResNet example has the same issue, after running:

pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113

Otherwise, it fails with other error.

Torch related modules versions are:

torch 1.12.0+cu113 torchaudio 0.12.0+cu113 torchvision 0.13.0+cu113

No CUDA with version 12.1 is found.

tiansiyuan commented 1 year ago

Setting CUDA_HOME properly will solve this problem, provided you have the needed CUDA installed.

export CUDA_HOME="/usr/local/cuda-11.7"

...... warnings.warn( /home/jovyan/work/workspace/tiansiyuan/siyuan2/lib/python3.10/site-packages/colossalai/kernel/op_builder/utils.py:94: UserWarning: [extension] The CUDA version on the system (11.7) does not match with the version (11.3) torch was compiled with. The mismatch is found in the minor version. As the APIs are compatible, we will allow compilation to proceed. If you encounter any issue when using the built kernel, please try to build it again with fully matched CUDA versions ......