当模型gradient_checkpointing时运行feature/zero/train_v2.py出错

wjizhong commented 1 year ago

🐛 Describe the bug

Traceback (most recent call last): File "/data1/users/jizhong1/ColossalAI-Examples/features/zero/train_v2.py", line 133, in main() File "/dirname/ColossalAI-Examples/features/zero/train_v2.py", line 123, in main optimizer.backward(loss) File "/python_path/lib/python3.9/site-packages/colossalai/zero/zero_optimizer.py", line 154, in backward self.module.backward(loss) File "/python_path/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 266, in backward loss.backward() File "/python_path/lib/python3.9/site-packages/torch/_tensor.py", line 388, in backward return handle_torch_function( File "/python_path/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function result = torch_func_method(public_api, types, args, kwargs) File "/python_path/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 171, in torch_function__ ret = func(*args, **kwargs) File "/python_path/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/python_path/lib/python3.9/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/python_path/lib/python3.9/site-packages/torch/autograd/function.py", line 253, in apply return user_fn(self, args) File "/python_path/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 130, in backward outputs = ctx.run_function(detached_inputs) File "/python_path/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 887, in custom_forward return module(inputs, use_cache, output_attentions) File "/python_path/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, *kwargs) File "/python_path/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 400, in forward hidden_states = self.ln_1(hidden_states) File "/python_path/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, **kwargs) File "/python_path/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward return F.layer_norm( File "/python_path/lib/python3.9/site-packages/torch/nn/functional.py", line 2503, in layer_norm return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 40088) of binary: /python_path/bin/python Traceback (most recent call last): File "/python_path/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')()) File "/python_path/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/python_path/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/python_path/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/python_path/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/python_path/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: