hpcaitech / ColossalAI-Examples

Examples of training models with hybrid parallelism using ColossalAI
Apache License 2.0
333 stars 102 forks source link

[Compatibility] Runining OPT using PyTorch 1.12 and Gemini placement_policy = 'cuda' failed #166

Open feifeibear opened 1 year ago

feifeibear commented 1 year ago

🐛 Describe the bug

Just run the examples/language/opt/run_clm.py will reproduce the error. The program crashed with no error information. After I replace placement_policy as 'cuda'. It is OK.

    placement_policy = 'cuda'
    chunk_manager = ChunkManager(chunk_size, process_group=pg,
                                 enable_distributed_storage=True,
                                 init_device=GeminiManager.get_default_device(placement_policy))
    gemini_manager = GeminiManager(placement_policy, chunk_manager)
    model = ZeroDDP(model, gemini_manager)
    logger.info(f'{model.__class__.__name__} has been created', ranks=[0])

Environment

colossalai 0.1.8+torch1.12cu11.3

feifeibear commented 1 year ago

I also tried placement_policy = 'cpu' It also crashed. The error stack is listed as follows

0%| | 0/444 [00:00<?, ?it/s]use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... Traceback (most recent call last): File "run_clm.py", line 575, in main() File "run_clm.py", line 528, in main optimizer.backward(loss) File "/home/lcfjr/codes/ColossalAI/colossalai/zero/zero_optimizer.py", line 151, in backward self.module.backward(loss) File "/home/lcfjr/codes/ColossalAI/colossalai/nn/parallel/data_parallel.py", line 246, in backward loss.backward() File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/_tensor.py", line 388, in backward return handle_torch_function( File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/overrides.py", line 1498, in handle_torch_function result = torch_func_method(public_api, types, args, kwargs) File "/home/lcfjr/codes/ColossalAI/colossalai/tensor/colo_tensor.py", line 171, in torch_function__ ret = func(*args, **kwargs) File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply return user_fn(self, args) File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 130, in backward outputs = ctx.run_function(detached_inputs) File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 674, in custom_forward return module(inputs, output_attentions, None) File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, *kwargs) File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/transformers/models/opt/modeling_opt.py", line 315, in forward hidden_states = self.self_attn_layer_norm(hidden_states) File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, **kwargs) File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 189, in forward return F.layer_norm( File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/nn/functional.py", line 2503, in layer_norm return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory. 0%| | 0/444 [00:06<?, ?it/s] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2895986) of binary: /home/lcfjr/miniconda3/envs/dev/bin/python3 Traceback (most recent call last): File "/home/lcfjr/miniconda3/envs/dev/bin/torchrun", line 33, in sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')()) File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init__.py", line 345, in wrapper return f(*args, **kwargs) File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lcfjr/miniconda3/envs/dev/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

wgimperial commented 1 year ago

Encountered the same problem, is there a solution?

virgulvirgul commented 1 year ago

After I replace placement_policy as 'cuda'. It is OK.

Got same error, fixed after these changes.