Closed amiltonwong closed 5 years ago
When I set the BS=1,I also ran into these proble,but BS=2 works fine...
@LeiyuanMa , thx for the suggestion.
After using BS=2, the training process starts but after iter=3
, it ran out of memory. (although my GPU has 12 GB memory).
2975 images are loaded!
iter = 0 of 400 completed, loss = 4.142366409301758
taking snapshot ...
iter = 1 of 400 completed, loss = 3.235548496246338
iter = 2 of 400 completed, loss = 2.8805861473083496
iter = 3 of 400 completed, loss = 1.8505399227142334
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File "train.py", line 251, in <module>
main()
File "train.py", line 218, in main
loss.backward()
File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/root/anaconda3/envs/tf1.3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 89, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
I didn't ran into this question,but the BS=2 can't reprodcue satisfy results,we should try with larger batch_size
Hi, @speedinghzl
I got
RuntimeError: invalid argument 3: divide by zero
forrunning_var.mul_((1 - ctx.momentum)).add_(ctx.momentum * var * n / (n - 1))
infunctions.py", line 209, in forward
Any suggestion to fix it?
THX!