Closed wjizhong closed 1 year ago
改了什么部分,以及torch版本是多少呢
系统Centos 7, 显卡Tesla P100, 驱动版本470.82, cuda版本11.3, torch版本1.12.1
checkout为True时就通过不了
系统Centos 7, 显卡Tesla P100, 驱动版本470.82, cuda版本11.3, torch版本1.12.1
checkout为True时就通过不了
我们暂不支持torch1.12
@wjizhong Sorry for the inconsistency, but ColossalAI with pytorch 1.12 is still under development. (My bad!)
The download page would be the most reliable reference: Supported Pytorch Version.
🐛 Describe the bug
Traceback (most recent call last): File "/data1/users/jizhong1/ColossalAI-Examples/features/zero/train_v2.py", line 133, in
main()
File "/dirname/ColossalAI-Examples/features/zero/train_v2.py", line 123, in main
optimizer.backward(loss)
File "/python_path/lib/python3.9/site-packages/colossalai/zero/zero_optimizer.py", line 154, in backward
self.module.backward(loss)
File "/python_path/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 266, in backward
loss.backward()
File "/python_path/lib/python3.9/site-packages/torch/_tensor.py", line 388, in backward
return handle_torch_function(
File "/python_path/lib/python3.9/site-packages/torch/overrides.py", line 1498, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/python_path/lib/python3.9/site-packages/colossalai/tensor/colo_tensor.py", line 171, in torch_function__
ret = func(*args, **kwargs)
File "/python_path/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/python_path/lib/python3.9/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/python_path/lib/python3.9/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, args)
File "/python_path/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 130, in backward
outputs = ctx.run_function(detached_inputs)
File "/python_path/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 887, in custom_forward
return module(inputs, use_cache, output_attentions)
File "/python_path/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(input, *kwargs)
File "/python_path/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 400, in forward
hidden_states = self.ln_1(hidden_states)
File "/python_path/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(input, **kwargs)
File "/python_path/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 189, in forward
return F.layer_norm(
File "/python_path/lib/python3.9/site-packages/torch/nn/functional.py", line 2503, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 40088) of binary: /python_path/bin/python
Traceback (most recent call last):
File "/python_path/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/python_path/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/ init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/python_path/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/python_path/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/python_path/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/python_path/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Environment
No response