nv-tlabs / LION

Latent Point Diffusion Models for 3D Shape Generation
Other
728 stars 55 forks source link

assert(context.shape[1] == self.num_points*self.context_dim) shapes don't match #51

Open kg571852741 opened 10 months ago

kg571852741 commented 10 months ago

Hi @ZENGXH , Thanks for your hard work. I am testing out custom dataset with (1076, 200000, 3), 200k size point cloud data. I've adjust few code line in pointflow_datasets.py. However, the final shape don't match in models/latent_points_ada.py: Any way to solve it or suggestions?

  context.shape[1] 40000
context.shape torch.Size([1, 40000])
self.num_points*self.context_dim 400000
self.num_points 100000
self.context_dim 4
         # TODO: why do we need this??
        # self.train_points = self.all_points[:, :min(
        #     10000, self.all_points.shape[1])]  # subsample 15k points to 10k points per shape
        self.train_points = self.all_points[:, :min(
        200000, self.all_points.shape[1])]  # depercate 15k points to 10k points per shape
        self.tr_sample_size = min(10000, tr_sample_size) # 100k points per shape

self.te_sample_size = min(5000, te_sample_size) andtrain_vae_sh settings

     shapelatent.decoder_num_points  100000 \
    data.tr_max_sample_points 100000 data.te_max_sample_points 100000 \

Revised few line codes

        # TODO: why do we need this??
        # self.train_points = self.all_points[:, :min(
        #     10000, self.all_points.shape[1])]  # subsample 15k points to 10k points per shape
        self.train_points = self.all_points[:, :min(
        200000, self.all_points.shape[1])]  # depercate 15k points to 10k points per shape
        self.tr_sample_size = min(10000, tr_sample_size) # 100k points per shape

        self.te_sample_size = min(5000, te_sample_size) 

2023-08-24 22:37:03.789 | INFO     | utils.utils:__init__:332 - Not init TFB
2023-08-24 22:37:03.790 | INFO     | utils.utils:common_init:511 - [common-init] DONE
2023-08-24 22:37:03.793 | INFO     | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-08-24 22:37:03.801 | INFO     | models.shapelatent_modules:__init__:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-08-24 22:37:03.802 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-08-24 22:37:03.803 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-08-24 22:37:03.871 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:03.872 | INFO     | models.latent_points_ada:__init__:241 - [Build Dec] point_dim=3, context_dim=1
2023-08-24 22:37:03.872 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-08-24 22:37:03.923 | INFO     | models.vae_adain:__init__:54 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:05.245 | INFO     | trainers.hvae_trainer:__init__:53 - broadcast_params: device=cuda:0
2023-08-24 22:37:05.245 | INFO     | trainers.base_trainer:build_other_module:722 - no other module to build
2023-08-24 22:37:05.245 | INFO     | trainers.base_trainer:build_data:152 - start build_data
2023-08-24 22:37:05.691 | INFO     | datasets.pointflow_datasets:get_datasets:393 - get_datasets: tr_sample_size=100000,  te_sample_size=100000;  random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
searching: pointflow, get: data/data_t_npy/
2023-08-24 22:37:05.691 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: train, full path: data/data__npy/; norm global=True, norm-box=False
2023-08-24 22:37:05.692 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [1076] under: data/data__npy/house/train 
2023-08-24 22:37:06.622 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.9s | dir: ['house'] | sample_with_replacement: 1; num points: 1076
2023-08-24 22:37:10.636 | INFO     | datasets.pointflow_datasets:__init__:270 - [DATA] normalize_global: mean=[-0.00717235 -0.04303095 -0.00708372], std=[0.20540998]
2023-08-24 22:37:14.391 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(1076, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.644, min=-2.400; num-pts=100000
searching: pointflow, get: data/data__npy/
2023-08-24 22:37:14.441 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: val, full path: data/data__npy/; norm global=True, norm-box=False
2023-08-24 22:37:14.443 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [135] under: data/data__npy/house/val 
2023-08-24 22:37:14.560 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.1s | dir: ['house'] | sample_with_replacement: 1; num points: 135
2023-08-24 22:37:14.905 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(135, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.469, min=-2.400; num-pts=100000
2023-08-24 22:37:14.918 | INFO     | datasets.pointflow_datasets:get_data_loaders:462 - [Batch Size] train=1, test=10; drop-last=1
2023-08-24 22:37:14.920 | INFO     | trainers.hvae_trainer:__init__:75 - done init trainer @cuda:0
2023-08-24 22:37:15.186 | INFO     | trainers.base_trainer:prepare_vis_data:682 - [prepare_vis_data] len of train_loader: 1076
train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f2b5e36ae80>
tr_x[-1].shape:  torch.Size([1, 100000, 3])
2023-08-24 22:37:15.456 | INFO     | trainers.base_trainer:prepare_vis_data:701 - tr_x: torch.Size([16, 100000, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 100000, 3])
2023-08-24 22:37:15.482 | INFO     | __main__:main:47 - param size = 22.402731M 
2023-08-24 22:37:15.483 | INFO     | trainers.base_trainer:set_writer:57 - 
----------

----------
2023-08-24 22:37:15.487 | INFO     | __main__:main:70 - not find any checkpoint: ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints, (exist=False), or snapshot ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints/snapshot, (exist=False)
2023-08-24 22:37:15.488 | INFO     | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 8000, batch-size=1 | Niter/epo=1076 | log freq=1076, viz freq 430400, val freq 200 
> /home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py(370)vis_recont()
-> x_list.append(v[b])
(Pdb) ^C--KeyboardInterrupt--
(Pdb) q
2023-08-24 22:37:40.372 | ERROR    | utils.utils:init_processes:1158 - An error has been caught in function 'init_processes', process 'MainProcess' (2820942), thread 'MainThread' (139833154426688):
Traceback (most recent call last):

  File "train_dist.py", line 251, in <module>
    utils.init_processes(0, size, main, args, config)
    │     │                 │     │     │     └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
    │     │                 │     │     └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
    │     │                 │     └ <function main at 0x7f2d64c749d0>
    │     │                 └ 1
    │     └ <function init_processes at 0x7f2d64c6bc10>
    └ <module 'utils.utils' from '/home/bim-group/Documents/GitHub/LION/utils/utils.py'>

> File "/home/bim-group/Documents/GitHub/LION/utils/utils.py", line 1158, in init_processes
    fn(args, config)
    │  │     └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
    │  └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
    └ <function main at 0x7f2d64c749d0>

  File "train_dist.py", line 86, in main
    trainer.train_epochs()
    │       └ <function BaseTrainer.train_epochs at 0x7f2baf6ba670>
    └ <trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 242, in train_epochs
    self.vis_recont(logs_info, writer, step)
    │    │          │          │       └ 0
    │    │          │          └ <utils.utils.Writer object at 0x7f2bacb39be0>
    │    │          └ {'hist/global_var': tensor([[4.1580e-02, 5.3833e-01, 7.4051e-01, 1.5042e+00, 8.3240e+00, 1.5077e-01,
    │    │                     3.8869e-02, 3.8...
    │    └ <function BaseTrainer.vis_recont at 0x7f2baf6ba8b0>
    └ <trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>

  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
           │     │       └ {}
           │     └ (<trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>, {'hist/global_var': tensor([[4.1580e-02, 5.3833e-01, 7.4051e-01, 1...
           └ <function BaseTrainer.vis_recont at 0x7f2baf6ba820>

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 370, in vis_recont
    x_list.append(v[b])
    │      │      │ └ 0
    │      │      └ tensor([[[ 0.7022, -0.6714, -1.9273],
    │      │                 [ 0.9940,  1.1579, -1.6293],
    │      │                 [ 0.7494, -0.5751, -1.3528],
    │      │                 .....
    │      └ <method 'append' of 'list' objects>
    └ [tensor([[ 0.7024, -0.6675, -1.9238],
              [ 0.9833,  1.1651, -1.6223],
              [ 0.7482, -0.5665, -1.3496],
              ...,
      ...

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 370, in vis_recont
    x_list.append(v[b])
    │      │      │ └ 0
    │      │      └ tensor([[[ 0.7022, -0.6714, -1.9273],
    │      │                 [ 0.9940,  1.1579, -1.6293],
    │      │                 [ 0.7494, -0.5751, -1.3528],
    │      │                 .....
    │      └ <method 'append' of 'list' objects>
    └ [tensor([[ 0.7024, -0.6675, -1.9238],
              [ 0.9833,  1.1651, -1.6223],
              [ 0.7482, -0.5665, -1.3496],
              ...,
      ...

  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
           │    │             └ <frame at 0x5561ae20aaa0, file '/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py', line 370, code vis_recont>
           │    └ <function Bdb.dispatch_line at 0x7f2d69d9e550>
           └ <pdb.Pdb object at 0x7f2b5e30d370>
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
       │    │               └ <class 'bdb.BdbQuit'>
       │    └ True
       └ <pdb.Pdb object at 0x7f2b5e30d370>

bdb.BdbQuit
COMET INFO: Uploading metrics, params, and assets to Comet before program termination (may take several seconds)
COMET INFO: The Python SDK has 3600 seconds to finish before aborting...
COMET INFO: Uploading 1 metrics, params and output messages
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/batch_utils.py", line 347, in accept
    return self._accept(callback)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/batch_utils.py", line 384, in _accept
    callback(list_to_sent)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/comet.py", line 511, in _send_stdout_messages_batch
    self._process_rest_api_send(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/comet.py", line 591, in _process_rest_api_send
    sender(**kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 3231, in send_stdout_batch
    self.post_from_endpoint(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2031, in post_from_endpoint
    return self._result_from_http_method(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2053, in _result_from_http_method
    return method(url, payload, **kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2134, in post
    return super(RestApiClient, self).post(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 1988, in post
    response = self.low_level_api_client.post(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 536, in post
    return self.do(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 639, in do
    response = session.request(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
KeyboardInterrupt
(lion_env) bim-group@bimgroup-MS-7D70:~/Documents/GitHub/LION$ ^C
(lion_env) bim-group@bimgroup-MS-7D70:~/Documents/GitHub/LION$ bash script/train_vae_bnet.sh 1
+ DATA=' ddpm.input_dim 3 data.cates house '
+ NGPU=1
+ num_node=1
+ BS=1
++ echo 'scale=2; 1/10'
++ bc
+ OPT_GRAD_CLIP=.10
+ total_bs=1
+ ((  1 > 128  ))
+ ENT='python train_dist.py --num_process_per_node 1 '
+ kl=0.5
+ lr=1e-3
+ latent=1
+ skip_weight=0.01
+ sigma_offset=6.0
+ loss=l1_sum
+ python train_dist.py --num_process_per_node 1 ddpm.num_steps 1 ddpm.ema 0 trainer.opt.vae_lr_warmup_epochs 0 trainer.opt.grad_clip .10 latent_pts.ada_mlp_init_scale 0.1 sde.kl_const_coeff_vada 1e-7 trainer.anneal_kl 1 sde.kl_max_coeff_vada 0.5 sde.kl_anneal_portion_vada 0.5 shapelatent.log_sigma_offset 6.0 latent_pts.skip_weight 0.01 trainer.opt.beta2 0.99 data.num_workers 4 ddpm.loss_weight_emd 1.0 trainer.epochs 8000 data.random_subsample 1 viz.viz_freq -400 viz.log_freq -1 viz.val_freq 200 data.batch_size 1 viz.save_freq 2000 trainer.type trainers.hvae_trainer model_config default shapelatent.model models.vae_adain shapelatent.decoder_type models.latent_points_ada.LatentPointDecPVC shapelatent.encoder_type models.latent_points_ada.PointTransPVC latent_pts.style_encoder models.shapelatent_modules.PointNetPlusEncoder shapelatent.prior_type normal shapelatent.latent_dim 1 trainer.opt.lr 1e-3 shapelatent.kl_weight 0.5 shapelatent.decoder_num_points 100000 data.tr_max_sample_points 100000 data.te_max_sample_points 100000 ddpm.loss_type l1_sum cmt lion ddpm.input_dim 3 data.cates house viz.viz_order '[2,0,1]' data.recenter_per_shape False data.normalize_global True
utils/utils.py: USE_COMET=1, USE_WB=0
2023-08-24 22:37:47.706 | INFO     | __main__:get_args:209 - EXP_ROOT: ../exp + exp name: 0824/house/21dd03h_hvae_lion_B1N100000, save dir: ../exp/0824/house/21dd03h_hvae_lion_B1N100000
2023-08-24 22:37:47.713 | INFO     | __main__:get_args:214 - save config at ../exp/0824/house/21dd03h_hvae_lion_B1N100000/cfg.yml
2023-08-24 22:37:47.713 | INFO     | __main__:get_args:217 - log dir: ../exp/0824/house/21dd03h_hvae_lion_B1N100000
2023-08-24 22:37:47.713 | INFO     | utils.utils:init_processes:1133 - set MASTER_PORT: 127.0.0.1, MASTER_PORT: 6020
2023-08-24 22:37:47.713 | INFO     | utils.utils:init_processes:1154 - init_process: rank=0, world_size=1
2023-08-24 22:37:47.737 | INFO     | __main__:main:29 - use trainer: trainers.hvae_trainer
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/emd_ext/build.ninja...
Building extension module emd_ext...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module emd_ext...
load emd_ext time: 0.118s
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/_pvcnn_backend/build.ninja...
Building extension module _pvcnn_backend...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module _pvcnn_backend...
2023-08-24 22:37:49.185 | INFO     | utils.utils:common_init:467 - [common-init] at rank=0, seed=1

2023-08-24 22:37:55.498 | INFO     | utils.utils:__init__:332 - Not init TFB
2023-08-24 22:37:55.498 | INFO     | utils.utils:common_init:511 - [common-init] DONE
2023-08-24 22:37:55.501 | INFO     | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-08-24 22:37:55.505 | INFO     | models.shapelatent_modules:__init__:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-08-24 22:37:55.505 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-08-24 22:37:55.506 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-08-24 22:37:55.557 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:55.557 | INFO     | models.latent_points_ada:__init__:241 - [Build Dec] point_dim=3, context_dim=1
2023-08-24 22:37:55.558 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-08-24 22:37:55.609 | INFO     | models.vae_adain:__init__:54 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:56.937 | INFO     | trainers.hvae_trainer:__init__:53 - broadcast_params: device=cuda:0
2023-08-24 22:37:56.937 | INFO     | trainers.base_trainer:build_other_module:722 - no other module to build
2023-08-24 22:37:56.937 | INFO     | trainers.base_trainer:build_data:152 - start build_data
2023-08-24 22:37:57.507 | INFO     | datasets.pointflow_datasets:get_datasets:393 - get_datasets: tr_sample_size=100000,  te_sample_size=100000;  random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
searching: pointflow, get: data/transform_buildingnet_npy/
2023-08-24 22:37:57.507 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: train, full path: data/transform_buildingnet_npy/; norm global=True, norm-box=False
2023-08-24 22:37:57.509 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [1076] under: data/transform_buildingnet_npy/house/train 
2023-08-24 22:37:58.454 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.9s | dir: ['house'] | sample_with_replacement: 1; num points: 1076
2023-08-24 22:38:02.066 | INFO     | datasets.pointflow_datasets:__init__:270 - [DATA] normalize_global: mean=[-0.00717235 -0.04303095 -0.00708372], std=[0.20540998]
2023-08-24 22:38:04.353 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(1076, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.644, min=-2.400; num-pts=100000
searching: pointflow, get: data/transform_buildingnet_npy/
2023-08-24 22:38:04.396 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: val, full path: data/transform_buildingnet_npy/; norm global=True, norm-box=False
2023-08-24 22:38:04.398 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [135] under: data/transform_buildingnet_npy/house/val 
2023-08-24 22:38:04.514 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.1s | dir: ['house'] | sample_with_replacement: 1; num points: 135
2023-08-24 22:38:04.855 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(135, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.469, min=-2.400; num-pts=100000
2023-08-24 22:38:04.863 | INFO     | datasets.pointflow_datasets:get_data_loaders:462 - [Batch Size] train=1, test=10; drop-last=1
2023-08-24 22:38:04.865 | INFO     | trainers.hvae_trainer:__init__:75 - done init trainer @cuda:0
2023-08-24 22:38:05.123 | INFO     | trainers.base_trainer:prepare_vis_data:682 - [prepare_vis_data] len of train_loader: 1076
train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f5f3a86d880>
tr_x[-1].shape:  torch.Size([1, 10000, 3])
2023-08-24 22:38:05.383 | INFO     | trainers.base_trainer:prepare_vis_data:701 - tr_x: torch.Size([16, 10000, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 10000, 3])
2023-08-24 22:38:05.396 | INFO     | __main__:main:47 - param size = 22.402731M 
2023-08-24 22:38:05.397 | INFO     | trainers.base_trainer:set_writer:57 - 
----------
[url]: https://www.comet.com/kg571852741/general/53e826d2f0544ecca7b21d35cc10c1f0
../exp/0824/house/21dd03h_hvae_lion_B1N100000
----------
2023-08-24 22:38:05.398 | INFO     | __main__:main:70 - not find any checkpoint: ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints, (exist=False), or snapshot ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints/snapshot, (exist=False)
2023-08-24 22:38:05.399 | INFO     | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 8000, batch-size=1 | Niter/epo=1076 | log freq=1076, viz freq 430400, val freq 200 
context.shape[1] 40000
context.shape torch.Size([1, 40000])
self.num_points*self.context_dim 400000
self.num_points 100000
self.context_dim 4
> /home/bim-group/Documents/GitHub/LION/models/latent_points_ada.py(279)forward()
-> assert(context.shape[1] == self.num_points*self.context_dim)
(Pdb) 
```2023-08-24 22:37:03.789 | INFO     | utils.utils:__init__:332 - Not init TFB
2023-08-24 22:37:03.790 | INFO     | utils.utils:common_init:511 - [common-init] DONE
2023-08-24 22:37:03.793 | INFO     | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-08-24 22:37:03.801 | INFO     | models.shapelatent_modules:__init__:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-08-24 22:37:03.802 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-08-24 22:37:03.803 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-08-24 22:37:03.871 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:03.872 | INFO     | models.latent_points_ada:__init__:241 - [Build Dec] point_dim=3, context_dim=1
2023-08-24 22:37:03.872 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-08-24 22:37:03.923 | INFO     | models.vae_adain:__init__:54 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:05.245 | INFO     | trainers.hvae_trainer:__init__:53 - broadcast_params: device=cuda:0
2023-08-24 22:37:05.245 | INFO     | trainers.base_trainer:build_other_module:722 - no other module to build
2023-08-24 22:37:05.245 | INFO     | trainers.base_trainer:build_data:152 - start build_data
2023-08-24 22:37:05.691 | INFO     | datasets.pointflow_datasets:get_datasets:393 - get_datasets: tr_sample_size=100000,  te_sample_size=100000;  random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
searching: pointflow, get: data
2023-08-24 22:37:05.691 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: train, full path: data/_npy/; norm global=True, norm-box=False
2023-08-24 22:37:05.692 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [1076] under: data/_npy/house/train 
2023-08-24 22:37:06.622 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.9s | dir: ['house'] | sample_with_replacement: 1; num points: 1076
2023-08-24 22:37:10.636 | INFO     | datasets.pointflow_datasets:__init__:270 - [DATA] normalize_global: mean=[-0.00717235 -0.04303095 -0.00708372], std=[0.20540998]
2023-08-24 22:37:14.391 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(1076, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.644, min=-2.400; num-pts=100000
searching: pointflow, get: data/npy/
2023-08-24 22:37:14.441 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: val, full path: data/t_npy/; norm global=True, norm-box=False
2023-08-24 22:37:14.443 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [135] under: data/_npy/house/val 
2023-08-24 22:37:14.560 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.1s | dir: ['house'] | sample_with_replacement: 1; num points: 135
2023-08-24 22:37:14.905 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(135, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.469, min=-2.400; num-pts=100000
2023-08-24 22:37:14.918 | INFO     | datasets.pointflow_datasets:get_data_loaders:462 - [Batch Size] train=1, test=10; drop-last=1
2023-08-24 22:37:14.920 | INFO     | trainers.hvae_trainer:__init__:75 - done init trainer @cuda:0
2023-08-24 22:37:15.186 | INFO     | trainers.base_trainer:prepare_vis_data:682 - [prepare_vis_data] len of train_loader: 1076
train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f2b5e36ae80>
tr_x[-1].shape:  torch.Size([1, 100000, 3])
2023-08-24 22:37:15.456 | INFO     | trainers.base_trainer:prepare_vis_data:701 - tr_x: torch.Size([16, 100000, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 100000, 3])
2023-08-24 22:37:15.482 | INFO     | __main__:main:47 - param size = 22.402731M 
2023-08-24 22:37:15.483 | INFO     | trainers.base_trainer:set_writer:57 - 
----------
[url]: https://www.comet.com/kg571852741/general/75ce6d1e28c3496c9b264a8567167fcc
../exp/0824/house/21dd03h_hvae_lion_B1N100000
----------
2023-08-24 22:37:15.487 | INFO     | __main__:main:70 - not find any checkpoint: ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints, (exist=False), or snapshot ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints/snapshot, (exist=False)
2023-08-24 22:37:15.488 | INFO     | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 8000, batch-size=1 | Niter/epo=1076 | log freq=1076, viz freq 430400, val freq 200 
> /home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py(370)vis_recont()
-> x_list.append(v[b])
(Pdb) ^C--KeyboardInterrupt--
(Pdb) q
2023-08-24 22:37:40.372 | ERROR    | utils.utils:init_processes:1158 - An error has been caught in function 'init_processes', process 'MainProcess' (2820942), thread 'MainThread' (139833154426688):
Traceback (most recent call last):

  File "train_dist.py", line 251, in <module>
    utils.init_processes(0, size, main, args, config)
    │     │                 │     │     │     └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
    │     │                 │     │     └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
    │     │                 │     └ <function main at 0x7f2d64c749d0>
    │     │                 └ 1
    │     └ <function init_processes at 0x7f2d64c6bc10>
    └ <module 'utils.utils' from '/home/bim-group/Documents/GitHub/LION/utils/utils.py'>

> File "/home/bim-group/Documents/GitHub/LION/utils/utils.py", line 1158, in init_processes
    fn(args, config)
    │  │     └ CfgNode({'dpm_ckpt': '', 'clipforge': CfgNode({'clip_model': 'ViT-B/32', 'enable': 0, 'feat_dim': 512}), 'eval_trainnll': 0, ...
    │  └ Namespace(autocast_eval=True, autocast_train=True, config='none', data='/tmp/nvae-diff/data', dataset='cifar10', distributed=...
    └ <function main at 0x7f2d64c749d0>

  File "train_dist.py", line 86, in main
    trainer.train_epochs()
    │       └ <function BaseTrainer.train_epochs at 0x7f2baf6ba670>
    └ <trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 242, in train_epochs
    self.vis_recont(logs_info, writer, step)
    │    │          │          │       └ 0
    │    │          │          └ <utils.utils.Writer object at 0x7f2bacb39be0>
    │    │          └ {'hist/global_var': tensor([[4.1580e-02, 5.3833e-01, 7.4051e-01, 1.5042e+00, 8.3240e+00, 1.5077e-01,
    │    │                     3.8869e-02, 3.8...
    │    └ <function BaseTrainer.vis_recont at 0x7f2baf6ba8b0>
    └ <trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>

  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
           │     │       └ {}
           │     └ (<trainers.hvae_trainer.Trainer object at 0x7f2bacafe310>, {'hist/global_var': tensor([[4.1580e-02, 5.3833e-01, 7.4051e-01, 1...
           └ <function BaseTrainer.vis_recont at 0x7f2baf6ba820>

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 370, in vis_recont
    x_list.append(v[b])
    │      │      │ └ 0
    │      │      └ tensor([[[ 0.7022, -0.6714, -1.9273],
    │      │                 [ 0.9940,  1.1579, -1.6293],
    │      │                 [ 0.7494, -0.5751, -1.3528],
    │      │                 .....
    │      └ <method 'append' of 'list' objects>
    └ [tensor([[ 0.7024, -0.6675, -1.9238],
              [ 0.9833,  1.1651, -1.6223],
              [ 0.7482, -0.5665, -1.3496],
              ...,
      ...

  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 370, in vis_recont
    x_list.append(v[b])
    │      │      │ └ 0
    │      │      └ tensor([[[ 0.7022, -0.6714, -1.9273],
    │      │                 [ 0.9940,  1.1579, -1.6293],
    │      │                 [ 0.7494, -0.5751, -1.3528],
    │      │                 .....
    │      └ <method 'append' of 'list' objects>
    └ [tensor([[ 0.7024, -0.6675, -1.9238],
              [ 0.9833,  1.1651, -1.6223],
              [ 0.7482, -0.5665, -1.3496],
              ...,
      ...

  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
           │    │             └ <frame at 0x5561ae20aaa0, file '/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py', line 370, code vis_recont>
           │    └ <function Bdb.dispatch_line at 0x7f2d69d9e550>
           └ <pdb.Pdb object at 0x7f2b5e30d370>
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
       │    │               └ <class 'bdb.BdbQuit'>
       │    └ True
       └ <pdb.Pdb object at 0x7f2b5e30d370>

bdb.BdbQuit
COMET INFO: Uploading metrics, params, and assets to Comet before program termination (may take several seconds)
COMET INFO: The Python SDK has 3600 seconds to finish before aborting...
COMET INFO: Uploading 1 metrics, params and output messages
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/batch_utils.py", line 347, in accept
    return self._accept(callback)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/batch_utils.py", line 384, in _accept
    callback(list_to_sent)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/comet.py", line 511, in _send_stdout_messages_batch
    self._process_rest_api_send(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/comet.py", line 591, in _process_rest_api_send
    sender(**kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 3231, in send_stdout_batch
    self.post_from_endpoint(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2031, in post_from_endpoint
    return self._result_from_http_method(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2053, in _result_from_http_method
    return method(url, payload, **kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 2134, in post
    return super(RestApiClient, self).post(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 1988, in post
    response = self.low_level_api_client.post(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 536, in post
    return self.do(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/comet_ml/connection.py", line 639, in do
    response = session.request(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/site-packages/urllib3/connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 1348, in getresponse
    response.begin()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/http/client.py", line 277, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/home/bim-group/anaconda3/envs/lion_env/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
KeyboardInterrupt
(lion_env) bim-group@bimgroup-MS-7D70:~/Documents/GitHub/LION$ ^C
(lion_env) bim-group@bimgroup-MS-7D70:~/Documents/GitHub/LION$ bash script/train_vae_bnet.sh 1
+ DATA=' ddpm.input_dim 3 data.cates house '
+ NGPU=1
+ num_node=1
+ BS=1
++ echo 'scale=2; 1/10'
++ bc
+ OPT_GRAD_CLIP=.10
+ total_bs=1
+ ((  1 > 128  ))
+ ENT='python train_dist.py --num_process_per_node 1 '
+ kl=0.5
+ lr=1e-3
+ latent=1
+ skip_weight=0.01
+ sigma_offset=6.0
+ loss=l1_sum
+ python train_dist.py --num_process_per_node 1 ddpm.num_steps 1 ddpm.ema 0 trainer.opt.vae_lr_warmup_epochs 0 trainer.opt.grad_clip .10 latent_pts.ada_mlp_init_scale 0.1 sde.kl_const_coeff_vada 1e-7 trainer.anneal_kl 1 sde.kl_max_coeff_vada 0.5 sde.kl_anneal_portion_vada 0.5 shapelatent.log_sigma_offset 6.0 latent_pts.skip_weight 0.01 trainer.opt.beta2 0.99 data.num_workers 4 ddpm.loss_weight_emd 1.0 trainer.epochs 8000 data.random_subsample 1 viz.viz_freq -400 viz.log_freq -1 viz.val_freq 200 data.batch_size 1 viz.save_freq 2000 trainer.type trainers.hvae_trainer model_config default shapelatent.model models.vae_adain shapelatent.decoder_type models.latent_points_ada.LatentPointDecPVC shapelatent.encoder_type models.latent_points_ada.PointTransPVC latent_pts.style_encoder models.shapelatent_modules.PointNetPlusEncoder shapelatent.prior_type normal shapelatent.latent_dim 1 trainer.opt.lr 1e-3 shapelatent.kl_weight 0.5 shapelatent.decoder_num_points 100000 data.tr_max_sample_points 100000 data.te_max_sample_points 100000 ddpm.loss_type l1_sum cmt lion ddpm.input_dim 3 data.cates house viz.viz_order '[2,0,1]' data.recenter_per_shape False data.normalize_global True
utils/utils.py: USE_COMET=1, USE_WB=0
2023-08-24 22:37:47.706 | INFO     | __main__:get_args:209 - EXP_ROOT: ../exp + exp name: 0824/house/21dd03h_hvae_lion_B1N100000, save dir: ../exp/0824/house/21dd03h_hvae_lion_B1N100000
2023-08-24 22:37:47.713 | INFO     | __main__:get_args:214 - save config at ../exp/0824/house/21dd03h_hvae_lion_B1N100000/cfg.yml
2023-08-24 22:37:47.713 | INFO     | __main__:get_args:217 - log dir: ../exp/0824/house/21dd03h_hvae_lion_B1N100000
2023-08-24 22:37:47.713 | INFO     | utils.utils:init_processes:1133 - set MASTER_PORT: 127.0.0.1, MASTER_PORT: 6020
2023-08-24 22:37:47.713 | INFO     | utils.utils:init_processes:1154 - init_process: rank=0, world_size=1
2023-08-24 22:37:47.737 | INFO     | __main__:main:29 - use trainer: trainers.hvae_trainer
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/emd_ext/build.ninja...
Building extension module emd_ext...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module emd_ext...
load emd_ext time: 0.118s
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/_pvcnn_backend/build.ninja...
Building extension module _pvcnn_backend...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module _pvcnn_backend...
2023-08-24 22:37:49.185 | INFO     | utils.utils:common_init:467 - [common-init] at rank=0, seed=1
COMET INFO: Experiment is live on comet.com https://www.comet.com/kg571852741/general/53e826d2f0544ecca7b21d35cc10c1f0

2023-08-24 22:37:55.498 | INFO     | utils.utils:__init__:332 - Not init TFB
2023-08-24 22:37:55.498 | INFO     | utils.utils:common_init:511 - [common-init] DONE
2023-08-24 22:37:55.501 | INFO     | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-08-24 22:37:55.505 | INFO     | models.shapelatent_modules:__init__:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-08-24 22:37:55.505 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-08-24 22:37:55.506 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-08-24 22:37:55.557 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:55.557 | INFO     | models.latent_points_ada:__init__:241 - [Build Dec] point_dim=3, context_dim=1
2023-08-24 22:37:55.558 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-08-24 22:37:55.609 | INFO     | models.vae_adain:__init__:54 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-08-24 22:37:56.937 | INFO     | trainers.hvae_trainer:__init__:53 - broadcast_params: device=cuda:0
2023-08-24 22:37:56.937 | INFO     | trainers.base_trainer:build_other_module:722 - no other module to build
2023-08-24 22:37:56.937 | INFO     | trainers.base_trainer:build_data:152 - start build_data
2023-08-24 22:37:57.507 | INFO     | datasets.pointflow_datasets:get_datasets:393 - get_datasets: tr_sample_size=100000,  te_sample_size=100000;  random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
searching: pointflow, get: datanpy/
2023-08-24 22:37:57.507 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: train, full path: data/_npy/; norm global=True, norm-box=False
2023-08-24 22:37:57.509 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [1076] under: data/py/house/train 
2023-08-24 22:37:58.454 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.9s | dir: ['house'] | sample_with_replacement: 1; num points: 1076
2023-08-24 22:38:02.066 | INFO     | datasets.pointflow_datasets:__init__:270 - [DATA] normalize_global: mean=[-0.00717235 -0.04303095 -0.00708372], std=[0.20540998]
2023-08-24 22:38:04.353 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(1076, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.644, min=-2.400; num-pts=100000
searching: pointflow, get: data/npy/
2023-08-24 22:38:04.396 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: val, full path: data/npy/; norm global=True, norm-box=False
2023-08-24 22:38:04.398 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [135] under: data/_npy/house/val 
2023-08-24 22:38:04.514 | INFO     | datasets.pointflow_datasets:__init__:204 - [DATA] Load data time: 0.1s | dir: ['house'] | sample_with_replacement: 1; num points: 135
2023-08-24 22:38:04.855 | INFO     | datasets.pointflow_datasets:__init__:277 - [DATA] shape=(135, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.469, min=-2.400; num-pts=100000
2023-08-24 22:38:04.863 | INFO     | datasets.pointflow_datasets:get_data_loaders:462 - [Batch Size] train=1, test=10; drop-last=1
2023-08-24 22:38:04.865 | INFO     | trainers.hvae_trainer:__init__:75 - done init trainer @cuda:0
2023-08-24 22:38:05.123 | INFO     | trainers.base_trainer:prepare_vis_data:682 - [prepare_vis_data] len of train_loader: 1076
train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f5f3a86d880>
tr_x[-1].shape:  torch.Size([1, 10000, 3])
2023-08-24 22:38:05.383 | INFO     | trainers.base_trainer:prepare_vis_data:701 - tr_x: torch.Size([16, 10000, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 10000, 3])
2023-08-24 22:38:05.396 | INFO     | __main__:main:47 - param size = 22.402731M 
2023-08-24 22:38:05.397 | INFO     | trainers.base_trainer:set_writer:57 - 
----------
[url]: https://www.comet.com/kg571852741/general/53e826d2f0544ecca7b21d35cc10c1f0
../exp/0824/house/21dd03h_hvae_lion_B1N100000
----------
2023-08-24 22:38:05.398 | INFO     | __main__:main:70 - not find any checkpoint: ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints, (exist=False), or snapshot ../exp/0824/house/21dd03h_hvae_lion_B1N100000/checkpoints/snapshot, (exist=False)
2023-08-24 22:38:05.399 | INFO     | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 8000, batch-size=1 | Niter/epo=1076 | log freq=1076, viz freq 430400, val freq 200 
context.shape[1] 40000
context.shape torch.Size([1, 40000])
self.num_points*self.context_dim 400000
self.num_points 100000
self.context_dim 4
> /home/bim-group/Documents/GitHub/LION/models/latent_points_ada.py(279)forward()
-> assert(context.shape[1] == self.num_points*self.context_dim)
(Pdb) 
ZENGXH commented 10 months ago

could you try changing this two line

self.tr_sample_size = min(10000, tr_sample_size) # 100k points per shape
self.te_sample_size = min(5000, te_sample_size) 

to

self.tr_sample_size = tr_sample_size
self.te_sample_size = te_sample_size

I guess it's because the original code cap the maximum number of points as 10k (which is not necessary when using none point-flow dataset): as a result, the input for the model is 10k points instead of 100k points.

btw, I never try generating 100k points (usually we using 2048 points per shape) before: just curious, are you able to fit to the GPU memory?

kg571852741 commented 10 months ago

@ZENGXH Thanks for prompt reply! :+1:

These are the training screenshots for generating 2048 (default setting, but with a changed data root path). As the decoding and latent point are set to 2048, the final results are unable to capture the model pattern.

step:0

recont-train (Step: 0)

step:134400

recont-train (Step: 134400)

Q: Able to fit to the GPU memory?

A: I ran all my tests (2048pt; batch 20 and 100k pts//1 batch) with 2 4090Ti GPUs settings (48GB memory), and the out-of-memory issues have not been found. :D

After chnaging to

self.tr_sample_size = tr_sample_size
self.te_sample_size = te_sample_size

x_list return [ ] from breakpoint () debugger set from trainers/base_trainer.py

            for k, v in output.items():
                if 'vis/' in k:
                    if b < x_0_pred.size(0):
                        x_list.append(x_0_pred[b])
                        name_list.append('pred')
                        print("vis_recont: ", k, v.shape)
                        print("x_0_pred: ", x_0_pred.shape)
                        print("x_0: ", x_0.shape)
                        print("x_t: ", x_t.shape)
                        breakpoint()
                    x_list.append(v[b])
                    name_list.append(k)

before with :

self.tr_sample_size = min(10000, tr_sample_size) # 100k points per shape
self.te_sample_size = min(5000, te_sample_size) 
  File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 370, in vis_recont
    x_list.append(v[b])
    │      │      │ └ 0
    │      │      └ tensor([[[ 0.7022, -0.6714, -1.9273],
    │      │                 [ 0.9940,  1.1579, -1.6293],
    │      │                 [ 0.7494, -0.5751, -1.3528],
    │      │                 .....
    │      └ <method 'append' of 'list' objects>
    └ [tensor([[ 0.7024, -0.6675, -1.9238],
              [ 0.9833,  1.1651, -1.6223],
              [ 0.7482, -0.5665, -1.3496],
              ...,
      ...

After changing log file

lion_env) bim-group@bimgroup-MS-7D70:~/Docu
![first-10- (Step: 5599)](https://github.com/nv-tlabs/LION/assets/39424493/bdddcdaa-9de9-471d-b74b-d2cd22c54034)
ments/GitHub/LION$ bash script/train_vae_bnet.sh 1
+ DATA=' ddpm.input_dim 3 data.cates house '
+ NGPU=1
+ num_node=1
+ BS=1
++ echo 'scale=2; 1/10'
++ bc
+ OPT_GRAD_CLIP=.10
+ total_bs=1
+ ((  1 > 128  ))
+ ENT='python train_dist.py --num_process_per_node 1 '
+ kl=0.5
+ lr=1e-3
+ latent=1
+ skip_weight=0.01
+ sigma_offset=6.0
+ loss=l1_sum
+ python train_dist.py --num_process_per_node 1 ddpm.num_steps 1 ddpm.ema 0 trainer.opt.vae_lr_warmup_epochs 0 trainer.opt.grad_clip .10 latent_pts.ada_mlp_init_scale 0.1 sde.kl_const_coeff_vada 1e-7 trainer.anneal_kl 1 sde.kl_max_coeff_vada 0.5 sde.kl_anneal_portion_vada 0.5 shapelatent.log_sigma_offset 6.0 latent_pts.skip_weight 0.01 trainer.opt.beta2 0.99 data.num_workers 4 ddpm.loss_weight_emd 1.0 trainer.epochs 8000 data.random_subsample 1 viz.viz_freq -400 viz.log_freq -1 viz.val_freq 200 data.batch_size 1 viz.save_freq 2000 trainer.type trainers.hvae_trainer model_config default shapelatent.model models.vae_adain shapelatent.decoder_type models.latent_points_ada.LatentPointDecPVC shapelatent.encoder_type models.latent_points_ada.PointTransPVC latent_pts.style_encoder models.shapelatent_modules.PointNetPlusEncoder shapelatent.prior_type normal shapelatent.latent_dim 1 trainer.opt.lr 1e-3 shapelatent.kl_weight 0.5 shapelatent.decoder_num_points 100000 data.tr_max_sample_points 100000 data.te_max_sample_points 100000 ddpm.loss_type l1_sum cmt lion ddpm.input_dim 3 data.cates house viz.viz_order '[2,0,1]' data.recenter_per_shape False data.normalize_global True
utils/utils.py: USE_COMET=1, USE_WB=0
2023-08-25 05:10:49.602 | INFO     | __main__:get_args:209 - EXP_ROOT: ../exp + exp name: 0825/house/21dd03h_hvae_lion_B1N100000, save dir: ../exp/0825/house/21dd03h_hvae_lion_B1N100000
2023-08-25 05:10:49.609 | INFO     | __main__:get_args:214 - save config at ../exp/0825/house/21dd03h_hvae_lion_B1N100000/cfg.yml
2023-08-25 05:10:49.609 | INFO     | __main__:get_args:217 - log dir: ../exp/0825/house/21dd03h_hvae_lion_B1N100000
2023-08-25 05:10:49.609 | INFO     | utils.utils:init_processes:1133 - set MASTER_PORT: 127.0.0.1, MASTER_PORT: 6020
2023-08-25 05:10:49.609 | INFO     | utils.utils:init_processes:1154 - init_process: rank=0, world_size=1
2023-08-25 05:10:49.632 | INFO     | __main__:main:29 - use trainer: trainers.hvae_trainer
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/emd_ext/build.ninja...
Building extension module emd_ext...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module emd_ext...
load emd_ext time: 0.111s
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/_pvcnn_backend/build.ninja...
Building extension module _pvcnn_backend...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module _pvcnn_backend...
2023-08-25 05:10:50.861 | INFO     | utils.utils:common_init:467 - [common-init] at rank=0, seed=1

2023-08-25 05:10:56.498 | INFO     | utils.utils:__init__:332 - Not init TFB
2023-08-25 05:10:56.498 | INFO     | utils.utils:common_init:511 - [common-init] DONE
2023-08-25 05:10:56.500 | INFO     | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-08-25 05:10:56.505 | INFO     | models.shapelatent_modules:__init__:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-08-25 05:10:56.505 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-08-25 05:10:56.506 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-08-25 05:10:56.558 | INFO     | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-08-25 05:10:56.558 | INFO     | models.latent_points_ada:__init__:241 - [Build Dec] point_dim=3, context_dim=1
2023-08-25 05:10:56.559 | INFO     | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-08-25 05:10:56.611 | INFO     | models.vae_adain:__init__:54 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-08-25 05:10:57.821 | INFO     | trainers.hvae_trainer:__init__:53 - broadcast_params: device=cuda:0
2023-08-25 05:10:57.821 | INFO     | trainers.base_trainer:build_other_module:725 - no other module to build
2023-08-25 05:10:57.821 | INFO     | trainers.base_trainer:build_data:152 - start build_data
2023-08-25 05:10:58.309 | INFO     | datasets.pointflow_datasets:get_datasets:400 - get_datasets: tr_sample_size=100000,  te_sample_size=100000;  random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
searching: pointflow, get: data/transform_data/
2023-08-25 05:10:58.309 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: train, full path: data/transform_data/; norm global=True, norm-box=False
2023-08-25 05:10:58.311 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [1146] under: data/transform_data/house/train 
2023-08-25 05:10:59.056 | INFO     | datasets.pointflow_datasets:__init__:206 - [DATA] Load data time: 0.7s | dir: ['house'] | sample_with_replacement: 1; num points: 1146
2023-08-25 05:11:01.133 | INFO     | datasets.pointflow_datasets:__init__:272 - [DATA] normalize_global: mean=[-0.00376302 -0.07752005 -0.00340251], std=[0.2103262]
2023-08-25 05:11:02.244 | INFO     | datasets.pointflow_datasets:__init__:279 - [DATA] shape=(1146, 100000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.746, min=-2.361; num-pts=100000
searching: pointflow, get: data/transform_data/
2023-08-25 05:11:02.306 | INFO     | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: val, full path: data/transform_data/; norm global=True, norm-box=False
2023-08-25 05:11:02.307 | INFO     | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [135] under: data/transform_data/house/val 
2023-08-25 05:11:02.423 | INFO     | datasets.pointflow_datasets:__init__:206 - [DATA] Load data time: 0.1s | dir: ['house'] | sample_with_replacement: 1; num points: 135
2023-08-25 05:11:02.763 | INFO     | datasets.pointflow_datasets:__init__:279 - [DATA] shape=(135, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.454, min=-2.361; num-pts=100000
2023-08-25 05:11:02.772 | INFO     | datasets.pointflow_datasets:get_data_loaders:469 - [Batch Size] train=1, test=10; drop-last=1
2023-08-25 05:11:02.775 | INFO     | trainers.hvae_trainer:__init__:75 - done init trainer @cuda:0
2023-08-25 05:11:03.076 | INFO     | trainers.base_trainer:prepare_vis_data:685 - [prepare_vis_data] len of train_loader: 1146
train_loader:  <torch.utils.data.dataloader.DataLoader object at 0x7f6ea9aff430>
tr_x[-1].shape:  torch.Size([1, 100000, 3])
2023-08-25 05:11:03.385 | INFO     | trainers.base_trainer:prepare_vis_data:704 - tr_x: torch.Size([16, 100000, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 100000, 3])
2023-08-25 05:11:03.399 | INFO     | __main__:main:47 - param size = 22.402731M 
2023-08-25 05:11:03.400 | INFO     | trainers.base_trainer:set_writer:57 - 
----------

----------
2023-08-25 05:11:03.403 | INFO     | __main__:main:70 - not find any checkpoint: ../exp/0825/house/21dd03h_hvae_lion_B1N100000/checkpoints, (exist=False), or snapshot ../exp/0825/house/21dd03h_hvae_lion_B1N100000/checkpoints/snapshot, (exist=False)
2023-08-25 05:11:03.403 | INFO     | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 8000, batch-size=1 | Niter/epo=1146 | log freq=1146, viz freq 458400, val freq 200 
context.shape forward( torch.Size([1, 400000])
context.shape[1] forward( 400000
vis_recont:  vis/latent_pts torch.Size([1, 100000, 3])
x_0_pred:  torch.Size([1, 100000, 3])
x_0:  torch.Size([1, 100000, 3])
> /home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py(373)vis_recont()
-> x_list.append(v[b])
(Pdb) p b.shape
*** AttributeError: 'int' object has no attribute 'shape'
(Pdb) p b 
0
aldinorizaldy commented 2 weeks ago

Hi @kg571852741, I see your dataset is some sort of outdoor scene. Can you elaborate on how you prepare your custom dataset?

Thanks in advance.

kg571852741 commented 1 week ago

Hi @aldinorizaldy. Sorry for the late reply. The work was done a very long time ago and I really cannot remember the setting, but I think I followed the organized data folders structure with the 'cifar10' ?