stevenygd / PointFlow

PointFlow : 3D Point Cloud Generation with Continuous Normalizing Flows
https://www.guandaoyang.com/PointFlow/
MIT License
720 stars 101 forks source link

TypeError: zip argument #1 must support iteration #3

Closed xiuliren closed 4 years ago

xiuliren commented 5 years ago

Thanks for sharing this nice code. I tried to run the autoencoder script without dist, but get some error. It seems that I have got some output which is not iterable?

BTW, I am using python 3.7 with virtualenv, not conda and python3.6. Pytorch version is 1.0.1.

Traceback (most recent call last):                                                                                    
  File "train.py", line 272, in <module>                                                                              
    main()                                                                                                            
  File "train.py", line 268, in main                                                                                  
    main_worker(args.gpu, save_dir, ngpus_per_node, args)                                                             
  File "train.py", line 176, in main_worker                                                                           
    out = model(inputs, optimizer, step, writer)                                                                      
  File "/usr/people/jingpeng/workspace/PointFlow/jwu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489
 in __call__                                                                                                          
    result = self.forward(*input, **kwargs)                                                                           
  File "/usr/people/jingpeng/workspace/PointFlow/models/networks.py", line 133, in forward                            
    z_mu, z_sigma = self.encoder(x)                                                                                   
  File "/usr/people/jingpeng/workspace/PointFlow/jwu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489
 in __call__                                                                                                          
    result = self.forward(*input, **kwargs)                                                                           
  File "/usr/people/jingpeng/workspace/PointFlow/jwu/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", 
ine 144, in forward                                                                                                   
    return self.gather(outputs, self.output_device)                                                                   
  File "/usr/people/jingpeng/workspace/PointFlow/jwu/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", 
ine 156, in gather                                                                                                    
    return gather(outputs, output_device, dim=self.dim)                                                               
  File "/usr/people/jingpeng/workspace/PointFlow/jwu/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py",
line 67, in gather                                                                                                    
    return gather_map(outputs)                                                                                        
  File "/usr/people/jingpeng/workspace/PointFlow/jwu/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py",
line 62, in gather_map                                                                                                
    return type(out)(map(gather_map, zip(*outputs)))                                                                  
  File "/usr/people/jingpeng/workspace/PointFlow/jwu/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py",
line 62, in gather_map                                                                                                
    return type(out)(map(gather_map, zip(*outputs)))                                                                  
TypeError: zip argument #1 must support iteration                                                                     
xiuliren commented 5 years ago

I got the same error using conda and following the documentation. I tried to use dist, but got some memory size error. I have 4 TitanX GPUs each with 12GB memory. I checked the memory usage, it start with ~2GB memory and suddenly jump to about 12GB.

    result = self.forward(*input, **kwargs)                                                                            
  File "/usr/people/jingpeng/workspace/PointFlow/models/cnf.py", line 93, in forward                                   
    options=self.solver_options,                                                                                       
  File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/adjoint.py", line 129, in odeint_adjoint
    ys = OdeintAdjointMethod.apply(*y0, func, t, flat_params, rtol, atol, method, options)                             
  File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/adjoint.py", line 18, in forward        
    ans = odeint(func, y0, t, rtol=rtol, atol=atol, method=method, options=options)                                    
  File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/odeint.py", line 72, in odeint          
    solution = solver.integrate(t)                                                                                     
  File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/solvers.py", line 31, in integrate      
    y = self.advance(t[i])                                                                                             
  File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/dopri5.py", line 90, in advance         
    self.rk_state = self._adaptive_dopri5_step(self.rk_state)                                                          
  File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/dopri5.py", line 103, in _adaptive_dopri
5_step                                                                                                                 
    y1, f1, y1_error, k = _runge_kutta_step(self.func, y0, f0, t0, dt, tableau=_DORMAND_PRINCE_SHAMPINE_TABLEAU)       
  File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/rk_common.py", line 52, in _runge_kutta_
step                                                                                                                   
    tuple(k_.append(f_) for k_, f_ in zip(k, func(ti, yi)))                                                            
  File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__   
    result = self.forward(*input, **kwargs)                                                                            
  File "/usr/people/jingpeng/workspace/PointFlow/models/odefunc.py", line 129, in forward                              
    dy = self.diffeq(tc, y)                                                                                            
  File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__   
    result = self.forward(*input, **kwargs)                                                                            
  File "/usr/people/jingpeng/workspace/PointFlow/models/odefunc.py", line 96, in forward                               
    dx = layer(context, dx)                                                                                            
  File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__   
    result = self.forward(*input, **kwargs)                                                                            
  File "/usr/people/jingpeng/workspace/PointFlow/models/diffeq_layers.py", line 85, in forward                         
    ret = self._layer(x) * gate + bias                                                                                 
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 11.91 GiB total capacity; 9.54 GiB already alloc
ated; 250.75 MiB free; 233.48 MiB cached)                                                                              
stevenygd commented 5 years ago

For the first error, I would say it could be caused by different environment. Please use the suggested way to set-up the environment before running the code.

For the second error, I wonder which script did you run? For single 12GB GPU, I would suggest using this script:/scripts/shapenet_airplane_ae.sh

xiuliren commented 5 years ago

I have tried the /scripts/shapenet_airplane_ae.sh, but get the first error as reported.

stevenygd commented 5 years ago

What's your pytorch version? Did you install pytorch using conda install pytorch=1.0.1 torchvision cudatoolkit=10.0 -c pytorch -y?

xiuliren commented 5 years ago

yes, I do installed pytorch using that command. the version is 1.0.1

import torch torch.version '1.0.1.post2'

Jingpeng Wu Postdoc in Princeton Neuroscience Institute, Princeton University, NJ website: jingpengw.com

On Thu, Aug 22, 2019 at 12:57 PM Guandao Yang notifications@github.com wrote:

What's your pytorch version? Did you install pytorch using conda install pytorch=1.0.1 torchvision cudatoolkit=10.0 -c pytorch -y?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stevenygd/PointFlow/issues/3?email_source=notifications&email_token=AB2MB5OKLDJFCF6ILS5HBU3QF3ARFA5CNFSM4IONKRA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD45XEGY#issuecomment-523989531, or mute the thread https://github.com/notifications/unsubscribe-auth/AB2MB5LBK4C54MZJFZMELG3QF3ARFANCNFSM4IONKRAQ .

stevenygd commented 5 years ago

For AE, could you try to comment out this line of code and see if it helps with debugging? https://github.com/stevenygd/PointFlow/blob/372215d42797ca062127f4bc4348d882136287b5/models/networks.py#L156

xiuliren commented 5 years ago

thanks for the suggestion. I just tried, it do not work. still have the same problem.

xiuliren commented 5 years ago

maybe I should try to turn off the dataparallel?

stevenygd commented 5 years ago

@jingpengw Just to double-check whether we are on the same page, when you say commenting out L156 didn't work, what error di you get?

xiuliren commented 5 years ago

It seems that I have got some output, just not the desired one? this is the error I get:

2501, solver='dopri5', sync_bn=False, te_max_sample_points=2048, time_length=0.5, tr_max_sample_points=2048, train_T=True, use_adjoint=True, use_deterministic_encoder=True, use_latent_flow=False, val_freq=10, viz_freq=1, weight_decay=0.0, world_size=1, zdim=128)
Number of trainable parameters of Point CNF: 927513
Total number of data:2832
Min number of points: (train)2048 (test)2048
Total number of data:405
Min number of points: (train)2048 (test)2048
Start epoch: 0 End epoch: 4000
Traceback (most recent call last):
  File "train.py", line 272, in <module>
    main()
  File "train.py", line 268, in main
    main_worker(args.gpu, save_dir, ngpus_per_node, args)
  File "train.py", line 176, in main_worker
    out = model(inputs, optimizer, step, writer)
  File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/people/jingpeng/workspace/PointFlow/models/networks.py", line 133, in forward
    z_mu, z_sigma = self.encoder(x)
  File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 144, in forward
    return self.gather(outputs, self.output_device)
  File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 156, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
    return gather_map(outputs)
  File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: zip argument #1 must support iteration
Done
stevenygd commented 4 years ago

Seemed like the problem could be in the data parallel wrapper. If you are training with single GPU setting, you could potentially remove the wrapper by setting an index for the gpu. You could see it from these lines: https://github.com/stevenygd/PointFlow/blob/4e6795a4d9b45a433068fd57b99ea909a9c58a96/train.py#L69

Basically I think adding a flag --gpu 0 should give you that.

doudoulaile commented 4 years ago

I got the same error using conda and following the documentation. I tried to use dist, but got some memory size error. I have 4 TitanX GPUs each with 12GB memory. I checked the memory usage, it start with ~2GB memory and suddenly jump to about 12GB.

    result = self.forward(*input, **kwargs)                                                                            
  File "/usr/people/jingpeng/workspace/PointFlow/models/cnf.py", line 93, in forward                                   
    options=self.solver_options,                                                                                       
  File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/adjoint.py", line 129, in odeint_adjoint
    ys = OdeintAdjointMethod.apply(*y0, func, t, flat_params, rtol, atol, method, options)                             
  File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/adjoint.py", line 18, in forward        
    ans = odeint(func, y0, t, rtol=rtol, atol=atol, method=method, options=options)                                    
  File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/odeint.py", line 72, in odeint          
    solution = solver.integrate(t)                                                                                     
  File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/solvers.py", line 31, in integrate      
    y = self.advance(t[i])                                                                                             
  File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/dopri5.py", line 90, in advance         
    self.rk_state = self._adaptive_dopri5_step(self.rk_state)                                                          
  File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/dopri5.py", line 103, in _adaptive_dopri
5_step                                                                                                                 
    y1, f1, y1_error, k = _runge_kutta_step(self.func, y0, f0, t0, dt, tableau=_DORMAND_PRINCE_SHAMPINE_TABLEAU)       
  File "/usr/people/jingpeng/workspace/PointFlow/torchdiffeq/torchdiffeq/_impl/rk_common.py", line 52, in _runge_kutta_
step                                                                                                                   
    tuple(k_.append(f_) for k_, f_ in zip(k, func(ti, yi)))                                                            
  File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__   
    result = self.forward(*input, **kwargs)                                                                            
  File "/usr/people/jingpeng/workspace/PointFlow/models/odefunc.py", line 129, in forward                              
    dy = self.diffeq(tc, y)                                                                                            
  File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__   
    result = self.forward(*input, **kwargs)                                                                            
  File "/usr/people/jingpeng/workspace/PointFlow/models/odefunc.py", line 96, in forward                               
    dx = layer(context, dx)                                                                                            
  File "/opt/anaconda3/envs/PointFlow/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__   
    result = self.forward(*input, **kwargs)                                                                            
  File "/usr/people/jingpeng/workspace/PointFlow/models/diffeq_layers.py", line 85, in forward                         
    ret = self._layer(x) * gate + bias                                                                                 
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 11.91 GiB total capacity; 9.54 GiB already alloc
ated; 250.75 MiB free; 233.48 MiB cached)                                                                              

I have the same problem , have you solved this problem?

doudoulaile commented 4 years ago

I have solve this problem: RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 11.91 GiB total capacity; 9.54 GiB already allocated; 250.75 MiB free; 233.48 MiB cached). for reduce the number of batch_size in dist file