jw9730 / setvae

[CVPR'21] SetVAE: Learning Hierarchical Composition for Generative Modeling of Set-Structured Data, in PyTorch
MIT License
68 stars 13 forks source link

Unable to load checkpoint #5

Closed JohanYe closed 2 years ago

JohanYe commented 2 years ago

Initially, i received an error regarding "maximum recursion depth exceeded while calling a Python object". I then set recursion limit to the highest python allows:

import sys, threading sys.setrecursionlimit(10**7)

Following which the training exits without any informative error [2022-08-26 06:40:38,321] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 1122823 [2022-08-26 06:40:38,321] [ERROR] [launch.py:292:sigkill_handler] ['/nfshome/USERNAME/.conda/envs/setvae/bin/python', '-u', 'train.py', '--local_rank=0', '--cates', 'tooth', '--input_dim', '3', '--max_outputs', '2500', '--init_dim', '32', '--n_mixtures', '4', '--z_dim', '16', '--z_scales', '1', '1', '2', '4', '8', '16', '32', '--hidden_dim', '64', '--num_heads', '4', '--num_workers', '0', '--kl_warmup_epochs', '250', '--fixed_gmm', '--train_gmm', '--lr', '1e-3', '--beta', '1.0', '--epochs', '1000', '--dataset_type', 'shapenet15k', '--log_name', 'gen/shapenet15k/camera-ready', '--shapenet_data_dir', '/train/SetVae/ShapeNetCore.v2.PC15k', '--save_freq', '25', '--viz_freq', '1000', '--log_freq', '10', '--val_freq', '10000', '--scheduler', 'linear', '--slot_att', '--ln', '--eval', '--seed', '42', '--distributed', '--deepspeed_config', 'batch_size.json'] exits with return code = -11

JohanYe commented 2 years ago

Turns out i had accidentally updated the deepspeed package, downgrading back to 0.3.13 fixed the issue