Open kg571852741 opened 10 months ago
could you try changing this two line
self.tr_sample_size = min(10000, tr_sample_size) # 100k points per shape
self.te_sample_size = min(5000, te_sample_size)
to
self.tr_sample_size = tr_sample_size
self.te_sample_size = te_sample_size
I guess it's because the original code cap the maximum number of points as 10k (which is not necessary when using none point-flow dataset): as a result, the input for the model is 10k points instead of 100k points.
btw, I never try generating 100k points (usually we using 2048 points per shape) before: just curious, are you able to fit to the GPU memory?
@ZENGXH Thanks for prompt reply! :+1:
These are the training screenshots for generating 2048 (default setting, but with a changed data root path). As the decoding and latent point are set to 2048, the final results are unable to capture the model pattern.
A: I ran all my tests (2048pt; batch 20 and 100k pts//1 batch) with 2 4090Ti GPUs settings (48GB memory), and the out-of-memory issues have not been found. :D
self.tr_sample_size = tr_sample_size
self.te_sample_size = te_sample_size
x_list return [ ] from breakpoint () debugger set from trainers/base_trainer.py
for k, v in output.items():
if 'vis/' in k:
if b < x_0_pred.size(0):
x_list.append(x_0_pred[b])
name_list.append('pred')
print("vis_recont: ", k, v.shape)
print("x_0_pred: ", x_0_pred.shape)
print("x_0: ", x_0.shape)
print("x_t: ", x_t.shape)
breakpoint()
x_list.append(v[b])
name_list.append(k)
before with :
self.tr_sample_size = min(10000, tr_sample_size) # 100k points per shape
self.te_sample_size = min(5000, te_sample_size)
File "/home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py", line 370, in vis_recont
x_list.append(v[b])
│ │ │ └ 0
│ │ └ tensor([[[ 0.7022, -0.6714, -1.9273],
│ │ [ 0.9940, 1.1579, -1.6293],
│ │ [ 0.7494, -0.5751, -1.3528],
│ │ .....
│ └ <method 'append' of 'list' objects>
└ [tensor([[ 0.7024, -0.6675, -1.9238],
[ 0.9833, 1.1651, -1.6223],
[ 0.7482, -0.5665, -1.3496],
...,
...
After changing log file
lion_env) bim-group@bimgroup-MS-7D70:~/Docu
![first-10- (Step: 5599)](https://github.com/nv-tlabs/LION/assets/39424493/bdddcdaa-9de9-471d-b74b-d2cd22c54034)
ments/GitHub/LION$ bash script/train_vae_bnet.sh 1
+ DATA=' ddpm.input_dim 3 data.cates house '
+ NGPU=1
+ num_node=1
+ BS=1
++ echo 'scale=2; 1/10'
++ bc
+ OPT_GRAD_CLIP=.10
+ total_bs=1
+ (( 1 > 128 ))
+ ENT='python train_dist.py --num_process_per_node 1 '
+ kl=0.5
+ lr=1e-3
+ latent=1
+ skip_weight=0.01
+ sigma_offset=6.0
+ loss=l1_sum
+ python train_dist.py --num_process_per_node 1 ddpm.num_steps 1 ddpm.ema 0 trainer.opt.vae_lr_warmup_epochs 0 trainer.opt.grad_clip .10 latent_pts.ada_mlp_init_scale 0.1 sde.kl_const_coeff_vada 1e-7 trainer.anneal_kl 1 sde.kl_max_coeff_vada 0.5 sde.kl_anneal_portion_vada 0.5 shapelatent.log_sigma_offset 6.0 latent_pts.skip_weight 0.01 trainer.opt.beta2 0.99 data.num_workers 4 ddpm.loss_weight_emd 1.0 trainer.epochs 8000 data.random_subsample 1 viz.viz_freq -400 viz.log_freq -1 viz.val_freq 200 data.batch_size 1 viz.save_freq 2000 trainer.type trainers.hvae_trainer model_config default shapelatent.model models.vae_adain shapelatent.decoder_type models.latent_points_ada.LatentPointDecPVC shapelatent.encoder_type models.latent_points_ada.PointTransPVC latent_pts.style_encoder models.shapelatent_modules.PointNetPlusEncoder shapelatent.prior_type normal shapelatent.latent_dim 1 trainer.opt.lr 1e-3 shapelatent.kl_weight 0.5 shapelatent.decoder_num_points 100000 data.tr_max_sample_points 100000 data.te_max_sample_points 100000 ddpm.loss_type l1_sum cmt lion ddpm.input_dim 3 data.cates house viz.viz_order '[2,0,1]' data.recenter_per_shape False data.normalize_global True
utils/utils.py: USE_COMET=1, USE_WB=0
2023-08-25 05:10:49.602 | INFO | __main__:get_args:209 - EXP_ROOT: ../exp + exp name: 0825/house/21dd03h_hvae_lion_B1N100000, save dir: ../exp/0825/house/21dd03h_hvae_lion_B1N100000
2023-08-25 05:10:49.609 | INFO | __main__:get_args:214 - save config at ../exp/0825/house/21dd03h_hvae_lion_B1N100000/cfg.yml
2023-08-25 05:10:49.609 | INFO | __main__:get_args:217 - log dir: ../exp/0825/house/21dd03h_hvae_lion_B1N100000
2023-08-25 05:10:49.609 | INFO | utils.utils:init_processes:1133 - set MASTER_PORT: 127.0.0.1, MASTER_PORT: 6020
2023-08-25 05:10:49.609 | INFO | utils.utils:init_processes:1154 - init_process: rank=0, world_size=1
2023-08-25 05:10:49.632 | INFO | __main__:main:29 - use trainer: trainers.hvae_trainer
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/emd_ext/build.ninja...
Building extension module emd_ext...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module emd_ext...
load emd_ext time: 0.111s
Using /home/bim-group/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/bim-group/.cache/torch_extensions/py38_cu111/_pvcnn_backend/build.ninja...
Building extension module _pvcnn_backend...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module _pvcnn_backend...
2023-08-25 05:10:50.861 | INFO | utils.utils:common_init:467 - [common-init] at rank=0, seed=1
2023-08-25 05:10:56.498 | INFO | utils.utils:__init__:332 - Not init TFB
2023-08-25 05:10:56.498 | INFO | utils.utils:common_init:511 - [common-init] DONE
2023-08-25 05:10:56.500 | INFO | utils.model_helper:import_model:106 - import: models.shapelatent_modules.PointNetPlusEncoder
2023-08-25 05:10:56.505 | INFO | models.shapelatent_modules:__init__:29 - [Encoder] zdim=128, out_sigma=True; force_att: 0
2023-08-25 05:10:56.505 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.PointTransPVC
2023-08-25 05:10:56.506 | INFO | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=0, input_dim=3
2023-08-25 05:10:56.558 | INFO | utils.model_helper:import_model:106 - import: models.latent_points_ada.LatentPointDecPVC
2023-08-25 05:10:56.558 | INFO | models.latent_points_ada:__init__:241 - [Build Dec] point_dim=3, context_dim=1
2023-08-25 05:10:56.559 | INFO | models.latent_points_ada:__init__:38 - [Build Unet] extra_feature_channels=1, input_dim=3
2023-08-25 05:10:56.611 | INFO | models.vae_adain:__init__:54 - [Build Model] style_encoder: models.shapelatent_modules.PointNetPlusEncoder, encoder: models.latent_points_ada.PointTransPVC, decoder: models.latent_points_ada.LatentPointDecPVC
2023-08-25 05:10:57.821 | INFO | trainers.hvae_trainer:__init__:53 - broadcast_params: device=cuda:0
2023-08-25 05:10:57.821 | INFO | trainers.base_trainer:build_other_module:725 - no other module to build
2023-08-25 05:10:57.821 | INFO | trainers.base_trainer:build_data:152 - start build_data
2023-08-25 05:10:58.309 | INFO | datasets.pointflow_datasets:get_datasets:400 - get_datasets: tr_sample_size=100000, te_sample_size=100000; random_subsample=1 normalize_global=True normalize_std_per_axix=False normalize_per_shape=False recenter_per_shape=False
searching: pointflow, get: data/transform_data/
2023-08-25 05:10:58.309 | INFO | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: train, full path: data/transform_data/; norm global=True, norm-box=False
2023-08-25 05:10:58.311 | INFO | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [1146] under: data/transform_data/house/train
2023-08-25 05:10:59.056 | INFO | datasets.pointflow_datasets:__init__:206 - [DATA] Load data time: 0.7s | dir: ['house'] | sample_with_replacement: 1; num points: 1146
2023-08-25 05:11:01.133 | INFO | datasets.pointflow_datasets:__init__:272 - [DATA] normalize_global: mean=[-0.00376302 -0.07752005 -0.00340251], std=[0.2103262]
2023-08-25 05:11:02.244 | INFO | datasets.pointflow_datasets:__init__:279 - [DATA] shape=(1146, 100000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.746, min=-2.361; num-pts=100000
searching: pointflow, get: data/transform_data/
2023-08-25 05:11:02.306 | INFO | datasets.pointflow_datasets:__init__:132 - [DATA] cat: house, split: val, full path: data/transform_data/; norm global=True, norm-box=False
2023-08-25 05:11:02.307 | INFO | datasets.pointflow_datasets:__init__:182 - [DATA] number of file [135] under: data/transform_data/house/val
2023-08-25 05:11:02.423 | INFO | datasets.pointflow_datasets:__init__:206 - [DATA] Load data time: 0.1s | dir: ['house'] | sample_with_replacement: 1; num points: 135
2023-08-25 05:11:02.763 | INFO | datasets.pointflow_datasets:__init__:279 - [DATA] shape=(135, 200000, 3), all_points_mean:=(1, 1, 3), std=(1, 1, 1), max=2.454, min=-2.361; num-pts=100000
2023-08-25 05:11:02.772 | INFO | datasets.pointflow_datasets:get_data_loaders:469 - [Batch Size] train=1, test=10; drop-last=1
2023-08-25 05:11:02.775 | INFO | trainers.hvae_trainer:__init__:75 - done init trainer @cuda:0
2023-08-25 05:11:03.076 | INFO | trainers.base_trainer:prepare_vis_data:685 - [prepare_vis_data] len of train_loader: 1146
train_loader: <torch.utils.data.dataloader.DataLoader object at 0x7f6ea9aff430>
tr_x[-1].shape: torch.Size([1, 100000, 3])
2023-08-25 05:11:03.385 | INFO | trainers.base_trainer:prepare_vis_data:704 - tr_x: torch.Size([16, 100000, 3]), m_pcs: torch.Size([16, 1, 3]), s_pcs: torch.Size([16, 1, 1]), val_x: torch.Size([16, 100000, 3])
2023-08-25 05:11:03.399 | INFO | __main__:main:47 - param size = 22.402731M
2023-08-25 05:11:03.400 | INFO | trainers.base_trainer:set_writer:57 -
----------
----------
2023-08-25 05:11:03.403 | INFO | __main__:main:70 - not find any checkpoint: ../exp/0825/house/21dd03h_hvae_lion_B1N100000/checkpoints, (exist=False), or snapshot ../exp/0825/house/21dd03h_hvae_lion_B1N100000/checkpoints/snapshot, (exist=False)
2023-08-25 05:11:03.403 | INFO | trainers.base_trainer:train_epochs:173 - [rank=0] Start epoch: 0 End epoch: 8000, batch-size=1 | Niter/epo=1146 | log freq=1146, viz freq 458400, val freq 200
context.shape forward( torch.Size([1, 400000])
context.shape[1] forward( 400000
vis_recont: vis/latent_pts torch.Size([1, 100000, 3])
x_0_pred: torch.Size([1, 100000, 3])
x_0: torch.Size([1, 100000, 3])
> /home/bim-group/Documents/GitHub/LION/trainers/base_trainer.py(373)vis_recont()
-> x_list.append(v[b])
(Pdb) p b.shape
*** AttributeError: 'int' object has no attribute 'shape'
(Pdb) p b
0
Hi @kg571852741, I see your dataset is some sort of outdoor scene. Can you elaborate on how you prepare your custom dataset?
Thanks in advance.
Hi @aldinorizaldy. Sorry for the late reply. The work was done a very long time ago and I really cannot remember the setting, but I think I followed the organized data folders structure with the 'cifar10' ?
Hi @ZENGXH , Thanks for your hard work. I am testing out custom dataset with (1076, 200000, 3), 200k size point cloud data. I've adjust few code line in
pointflow_datasets.py
. However, the final shape don't match inmodels/latent_points_ada.py:
Any way to solve it or suggestions?self.te_sample_size = min(5000, te_sample_size)
andtrain_vae_sh
settingsRevised few line codes