hpcaitech / ColossalAI-Examples

Examples of training models with hybrid parallelism using ColossalAI
Apache License 2.0
333 stars 102 forks source link

当模型gradient_checkpointing时运行feature/zero/train_v2.py出错 #169

Closed wjizhong closed 1 year ago

wjizhong commented 1 year ago

🐛 Describe the bug

Traceback (most recent call last): File "/data1/users/jizhong1/ColossalAI-Examples/features/zero/train_v2.py", line 133, in main() File "/dirname/ColossalAI-Examples/features/zero/train_v2.py", line 123, in main optimizer.backward(loss) File "/python_path/lib/python3.9/site-packages/colossalai/zero/zero_optimizer.py", line 154, in backward self.module.backward(loss) File "/python_path/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 266, in backward loss.backward() File "/python_path/lib/python3.9/site-packages/torch/_tensor.py", line 388, in backward return handle_torch_function( File "/python_path/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function result = torch_func_method(public_api, types, args, kwargs) File "/python_path/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 171, in torch_function__ ret = func(*args, **kwargs) File "/python_path/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/python_path/lib/python3.9/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/python_path/lib/python3.9/site-packages/torch/autograd/function.py", line 253, in apply return user_fn(self, args) File "/python_path/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 130, in backward outputs = ctx.run_function(detached_inputs) File "/python_path/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 887, in custom_forward return module(inputs, use_cache, output_attentions) File "/python_path/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, *kwargs) File "/python_path/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 400, in forward hidden_states = self.ln_1(hidden_states) File "/python_path/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, **kwargs) File "/python_path/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward return F.layer_norm( File "/python_path/lib/python3.9/site-packages/torch/nn/functional.py", line 2503, in layer_norm return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 40088) of binary: /python_path/bin/python Traceback (most recent call last): File "/python_path/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')()) File "/python_path/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/python_path/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/python_path/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/python_path/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/python_path/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Environment

No response

ver217 commented 1 year ago

改了什么部分,以及torch版本是多少呢

wjizhong commented 1 year ago

系统Centos 7, 显卡Tesla P100, 驱动版本470.82, cuda版本11.3, torch版本1.12.1

checkout为True时就通过不了

image
ver217 commented 1 year ago

系统Centos 7, 显卡Tesla P100, 驱动版本470.82, cuda版本11.3, torch版本1.12.1

checkout为True时就通过不了 image

我们暂不支持torch1.12

wjizhong commented 1 year ago
image
ofey404 commented 1 year ago
image

@wjizhong Sorry for the inconsistency, but ColossalAI with pytorch 1.12 is still under development. (My bad!)

The download page would be the most reliable reference: Supported Pytorch Version.