Closed xiuliren closed 4 years ago
I got the same error using conda and following the documentation. I tried to use dist, but got some memory size error. I have 4 TitanX GPUs each with 12GB memory. I checked the memory usage, it start with ~2GB memory and suddenly jump to about 12GB.
result = self.forward(*input, **kwargs)
File "/usr/people/jingpeng/workspace/PointFlow/models/cnf.py", line 93, in forward
options=self.solver_options,
File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/adjoint.py", line 129, in odeint_adjoint
ys = OdeintAdjointMethod.apply(*y0, func, t, flat_params, rtol, atol, method, options)
File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/adjoint.py", line 18, in forward
ans = odeint(func, y0, t, rtol=rtol, atol=atol, method=method, options=options)
File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/odeint.py", line 72, in odeint
solution = solver.integrate(t)
File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/solvers.py", line 31, in integrate
y = self.advance(t[i])
File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/dopri5.py", line 90, in advance
self.rk_state = self._adaptive_dopri5_step(self.rk_state)
File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/dopri5.py", line 103, in _adaptive_dopri
5_step
y1, f1, y1_error, k = _runge_kutta_step(self.func, y0, f0, t0, dt, tableau=_DORMAND_PRINCE_SHAMPINE_TABLEAU)
File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/rk_common.py", line 52, in _runge_kutta_
step
tuple(k_.append(f_) for k_, f_ in zip(k, func(ti, yi)))
File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/usr/people/jingpeng/workspace/PointFlow/models/odefunc.py", line 129, in forward
dy = self.diffeq(tc, y)
File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/usr/people/jingpeng/workspace/PointFlow/models/odefunc.py", line 96, in forward
dx = layer(context, dx)
File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/usr/people/jingpeng/workspace/PointFlow/models/diffeq_layers.py", line 85, in forward
ret = self._layer(x) * gate + bias
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 11.91 GiB total capacity; 9.54 GiB already alloc
ated; 250.75 MiB free; 233.48 MiB cached)
For the first error, I would say it could be caused by different environment. Please use the suggested way to set-up the environment before running the code.
For the second error, I wonder which script did you run? For single 12GB GPU, I would suggest using this script:/scripts/shapenet_airplane_ae.sh
I have tried the /scripts/shapenet_airplane_ae.sh
, but get the first error as reported.
What's your pytorch version? Did you install pytorch using conda install pytorch=1.0.1 torchvision cudatoolkit=10.0 -c pytorch -y
?
yes, I do installed pytorch using that command. the version is 1.0.1
import torch torch.version '1.0.1.post2'
Jingpeng Wu Postdoc in Princeton Neuroscience Institute, Princeton University, NJ website: jingpengw.com
On Thu, Aug 22, 2019 at 12:57 PM Guandao Yang notifications@github.com wrote:
What's your pytorch version? Did you install pytorch using conda install pytorch=1.0.1 torchvision cudatoolkit=10.0 -c pytorch -y?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stevenygd/PointFlow/issues/3?email_source=notifications&email_token=AB2MB5OKLDJFCF6ILS5HBU3QF3ARFA5CNFSM4IONKRA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD45XEGY#issuecomment-523989531, or mute the thread https://github.com/notifications/unsubscribe-auth/AB2MB5LBK4C54MZJFZMELG3QF3ARFANCNFSM4IONKRAQ .
For AE, could you try to comment out this line of code and see if it helps with debugging? https://github.com/stevenygd/PointFlow/blob/372215d42797ca062127f4bc4348d882136287b5/models/networks.py#L156
thanks for the suggestion. I just tried, it do not work. still have the same problem.
maybe I should try to turn off the dataparallel?
@jingpengw Just to double-check whether we are on the same page, when you say commenting out L156 didn't work, what error di you get?
It seems that I have got some output, just not the desired one? this is the error I get:
2501, solver='dopri5', sync_bn=False, te_max_sample_points=2048, time_length=0.5, tr_max_sample_points=2048, train_T=True, use_adjoint=True, use_deterministic_encoder=True, use_latent_flow=False, val_freq=10, viz_freq=1, weight_decay=0.0, world_size=1, zdim=128)
Number of trainable parameters of Point CNF: 927513
Total number of data:2832
Min number of points: (train)2048 (test)2048
Total number of data:405
Min number of points: (train)2048 (test)2048
Start epoch: 0 End epoch: 4000
Traceback (most recent call last):
File "train.py", line 272, in <module>
main()
File "train.py", line 268, in main
main_worker(args.gpu, save_dir, ngpus_per_node, args)
File "train.py", line 176, in main_worker
out = model(inputs, optimizer, step, writer)
File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/usr/people/jingpeng/workspace/PointFlow/models/networks.py", line 133, in forward
z_mu, z_sigma = self.encoder(x)
File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 144, in forward
return self.gather(outputs, self.output_device)
File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 156, in gather
return gather(outputs, output_device, dim=self.dim)
File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
TypeError: zip argument #1 must support iteration
Done
Seemed like the problem could be in the data parallel wrapper. If you are training with single GPU setting, you could potentially remove the wrapper by setting an index for the gpu. You could see it from these lines: https://github.com/stevenygd/PointFlow/blob/4e6795a4d9b45a433068fd57b99ea909a9c58a96/train.py#L69
Basically I think adding a flag --gpu 0
should give you that.
I got the same error using conda and following the documentation. I tried to use dist, but got some memory size error. I have 4 TitanX GPUs each with 12GB memory. I checked the memory usage, it start with ~2GB memory and suddenly jump to about 12GB.
result = self.forward(*input, **kwargs) File "/usr/people/jingpeng/workspace/PointFlow/models/cnf.py", line 93, in forward options=self.solver_options, File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/adjoint.py", line 129, in odeint_adjoint ys = OdeintAdjointMethod.apply(*y0, func, t, flat_params, rtol, atol, method, options) File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/adjoint.py", line 18, in forward ans = odeint(func, y0, t, rtol=rtol, atol=atol, method=method, options=options) File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/odeint.py", line 72, in odeint solution = solver.integrate(t) File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/solvers.py", line 31, in integrate y = self.advance(t[i]) File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/dopri5.py", line 90, in advance self.rk_state = self._adaptive_dopri5_step(self.rk_state) File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/dopri5.py", line 103, in _adaptive_dopri 5_step y1, f1, y1_error, k = _runge_kutta_step(self.func, y0, f0, t0, dt, tableau=_DORMAND_PRINCE_SHAMPINE_TABLEAU) File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/rk_common.py", line 52, in _runge_kutta_ step tuple(k_.append(f_) for k_, f_ in zip(k, func(ti, yi))) File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/usr/people/jingpeng/workspace/PointFlow/models/odefunc.py", line 129, in forward dy = self.diffeq(tc, y) File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/usr/people/jingpeng/workspace/PointFlow/models/odefunc.py", line 96, in forward dx = layer(context, dx) File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/usr/people/jingpeng/workspace/PointFlow/models/diffeq_layers.py", line 85, in forward ret = self._layer(x) * gate + bias RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 11.91 GiB total capacity; 9.54 GiB already alloc ated; 250.75 MiB free; 233.48 MiB cached)
I have the same problem , have you solved this problem?
I have solve this problem: RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 11.91 GiB total capacity; 9.54 GiB already allocated; 250.75 MiB free; 233.48 MiB cached). for reduce the number of batch_size in dist file
Thanks for sharing this nice code. I tried to run the autoencoder script without dist, but get some error. It seems that I have got some output which is not iterable?
BTW, I am using python 3.7 with virtualenv, not conda and python3.6. Pytorch version is 1.0.1.