hpcaitech / ColossalAI-Examples

Examples of training models with hybrid parallelism using ColossalAI
Apache License 2.0
334 stars 102 forks source link

failed to run gpt example #36

Closed feifeibear closed 2 years ago

feifeibear commented 2 years ago

πŸ› Describe the bug

cd ColossalAI/examples/language/gpt
torchrun --standalone --nproc_per_node=1 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch

bash: /opt/lcsoftware/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/miniconda3-4.10.3-u6p3tgreee7aigtnvuhr44yqo7vcg6r6/lib/libtinfo.so.6: no version information available (required by bash) Colossalai should be built with cuda extension to use the FP16 optimizer /home/lcfjr/.local/lib/python3.9/site-packages/torch/cuda/init.py:143: UserWarning: NVIDIA A100-PCIE-80GB with CUDA capability sm_80 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70. If you want to use the NVIDIA A100-PCIE-80GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name)) colossalai - colossalai - 2022-02-24 15:04:02,751 INFO: process rank 0 is bound to device 0 colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: initialized seed on rank 0, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA. colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: Distributed environment is initialized, data parallel size: 1, pipeline parallel size: 1, tensor parallel size: 1 colossalai - colossalai - 2022-02-24 15:04:02,772 INFO: Build data loader colossalai - colossalai - 2022-02-24 15:04:02,864 INFO: Build model Traceback (most recent call last): File "/home/lcfjr/codes/ColossalAI/examples/language/gpt/train_gpt.py", line 118, in main() File "/home/lcfjr/codes/ColossalAI/examples/language/gpt/train_gpt.py", line 49, in main model = gpc.config.model.pop('type')(gpc.config.model) File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 402, in gpt2_small return _create_gpt_model(model_kwargs) File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 368, in _create_gpt_model model = GPT(model_kwargs) File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper f(module, *args, *kwargs) File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 261, in init self.embed = GPTEmbedding(embedding_dim=dim, File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper f(module, args, kwargs) File "/home/lcfjr/.local/lib/python3.9/site-packages/model_zoo/gpt/gpt.py", line 33, in init self.word_embeddings = col_nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx, dtype=dtype) File "/home/lcfjr/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 254, in wrapper f(module, *args, *kwargs) File "/home/lcfjr/.local/lib/python3.9/site-packages/colossalai/nn/layer/colossalai_layer/embedding.py", line 69, in init weight_initializer(self.embed.weight, fan_in=num_embeddings, fan_out=embeddingdim) File "/home/lcfjr/.local/lib/python3.9/site-packages/colossalai/nn/init.py", line 31, in initializer return nn.init.normal(tensor, mean, std) File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/nn/init.py", line 151, in normal_ return _no_gradnormal(tensor, mean, std) File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/nn/init.py", line 19, in _no_gradnormal return tensor.normal_(mean, std) RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'HPC-AI_1150681_0' has failed to send a keep-alive heartbeat to the rendezvous 'a5650b64-ab96-467e-861a-b345eaa8ab3b' due to an error of type RendezvousConnectionError. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1150747) of binary: /opt/lcsoftware/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-9.3.0/miniconda3-4.10.3-u6p3tgreee7aigtnvuhr44yqo7vcg6r6/bin/python ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 0.00041747093200683594 seconds Traceback (most recent call last): File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 899, in _exit_barrier store_util.barrier( File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier synchronize(store, data, rank, world_size, key_prefix, barrier_timeout) File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 52, in synchronize store.set(f"{key_prefix}{rank}", data) RuntimeError: Broken pipe WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node 'HPC-AI_1150681_0' has failed to shutdown the rendezvous 'a5650b64-ab96-467e-861a-b345eaa8ab3b' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/home/lcfjr/.local/bin/torchrun", line 10, in sys.exit(main()) File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(args, **kwargs) File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lcfjr/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_gpt.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-02-24_15:04:10 host : HPC-AI rank : 0 (local_rank: 0) exitcode : 1 (pid: 1150747) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ### Environment _No response_
feifeibear commented 2 years ago

torch version 1.10.2

└─(16:25:02)──> nvcc --version ──(Thu,Feb24)β”€β”˜ nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Mon_Oct_12_20:09:46_PDT_2020 Cuda compilation tools, release 11.1, V11.1.105 Build cuda_11.1.TC455_06.29190527_0

feifeibear commented 2 years ago

The issue comes from the version version of torch. '1.10.2+cu102' torch.nn.init.normal_ dose not support to normalize a gpu tensor.

feifeibear commented 2 years ago

Fixed the issue after I correcly install pytorch version.