Open zhywanna opened 3 years ago
it happened after the line "Finished Training"
7.55 GiB reserved in total by PyTorch
I'm not sure what happened here. but it seems like PyTorch eating up a lot of memory.
pt.cuda.empty_cache()
before call getMPIthanks for your reply!i add pt.cuda.empty_cache()
here but the error still happened...
In addition, it happened when i use my own dataset. However, I can run your crest_demo without any problem.
my gpu: nvidia 1080Ti * 4
Loading Model @ Epoch 4000 "train.py" in
751: train()
train.py
"train.py" in train
584: generateAlpha(model, dataset, dataloader_val, None, runpath, dataloader_train = dataloader_train)
train.py
"train.py" in generateAlpha
494: info = getMPI(model, dataset.sfm, dataloader = dataloader_train)
train.py
"train.py" in getMPI
455: out = model.seq1(bigcoords)
train.py
"module.py" in _call_impl
889: result = self.forward(*input, kwargs)
/root/anaconda3/envs/nex/lib/python3.8/site-packages/torch/nn/modules/module.py
"data_parallel.py" in forward
167: outputs = self.parallel_apply(replicas, inputs, kwargs)
/root/anaconda3/envs/nex/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py
"data_parallel.py" in parallel_apply
177: return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
/root/anaconda3/envs/nex/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py
"parallel_apply.py" in parallel_apply
86: output.reraise()
/root/anaconda3/envs/nex/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py
"_utils.py" in reraise
429: raise self.exc_type(msg)
/root/anaconda3/envs/nex/lib/python3.8/site-packages/torch/_utils.py
RuntimeError: Caught RuntimeError in replica 0 on device 1.
Original Traceback (most recent call last):
File "/root/anaconda3/envs/nex/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, *kwargs)
File "/root/anaconda3/envs/nex/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, kwargs)
File "/nex/utils/mlp.py", line 29, in forward
return self.seq1(x)
File "/root/anaconda3/envs/nex/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, *kwargs)
File "/root/anaconda3/envs/nex/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/root/anaconda3/envs/nex/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, **kwargs)
File "/root/anaconda3/envs/nex/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 714, in forward
return F.leaky_relu(input, self.negative_slope, self.inplace)
File "/root/anaconda3/envs/nex/lib/python3.8/site-packages/torch/nn/functional.py", line 1378, in leaky_relu
result = torch._C._nn.leaky_relu(input, negative_slope)
RuntimeError: CUDA out of memory. Tried to allocate 3.51 GiB (GPU 1; 10.92 GiB total capacity; 4.26 GiB already allocated; 2.61 GiB free; 7.55 GiB reserved in total by PyTorch)