google-research / xmcgan_image_generation

98 stars 15 forks source link

error: train.sh: line 24: 45523 Segmentation fault (core dumped) #17

Open zzy994491827 opened 2 years ago

zzy994491827 commented 2 years ago

error: I1216 05:03:33.303731 140638140207296 utils.py:31] Checkpoint.restore_or_initialize() ... I1216 05:03:33.304307 140638140207296 checkpoint.py:301] No checkpoint specified. Restore the latest checkpoint. I1216 05:03:33.304460 140638140207296 utils.py:31] MultihostCheckpoint.get_latest_checkpoint_to_restore_from() ... I1216 05:03:33.312287 140638140207296 checkpoint.py:430] Checked checkpoint base_directories: ['path/to/exp/exp_name/checkpoints-0'] - common_numbers={1} - exclusive_numbers=set() I1216 05:03:33.312516 140638140207296 utils.py:41] MultihostCheckpoint.get_latest_checkpoint_to_restore_from() finished after 0.01s. I1216 05:03:33.312650 140638140207296 checkpoint.py:307] Restoring checkpoint: path/to/exp/exp_name/checkpoints-0/ckpt-1 2021-12-16 05:03:33.316385: W ./tensorflow/core/framework/dataset.h:550] Failed precondition: StatelessRandomGetKeyCounter is stateful. I1216 05:03:45.659061 140638140207296 checkpoint.py:312] Restored save_counter=1 restored_checkpoint=path/to/exp/exp_name/checkpoints-0/ckpt-1 I1216 05:03:45.659443 140638140207296 utils.py:41] Checkpoint.restore_or_initialize() finished after 12.36s. I1216 05:03:47.525738 140590360545024 logging_writer.py:56] Hyperparameters: {'architecture': 'xmc_net', 'batch_norm_group_size': -1, 'batch_size': 8, 'beta1': 0.5, 'beta2': 0.999, 'checkpoint_every_steps': 5000, 'coco_version': '2014', 'cond_size': 16, 'd_lr': 0.0004, 'd_spectral_norm': True, 'd_step_per_g_step': 14, 'data_dir': 'data/', 'dataset': 'mscoco', 'df_dim': 96, 'dtype': 'bfloat16', 'eval_avg_num': 3, 'eval_batch_size': 4, 'eval_every_steps': 1000, 'eval_num': 30000, 'g_lr': 0.0001, 'g_spectral_norm': False, 'gamma_for_g': 15, 'gf_dim': 96, 'image_contrastive': True, 'image_size': 128, 'log_loss_every_steps': 1000, 'model_name': 'xmc', 'num_epochs': 500, 'num_train_steps': -1, 'polyak_decay': 0.999, 'pretrained_image_contrastive': True, 'return_filename': False, 'return_text': False, 'seed': 42, 'sentence_contrastive': True, 'show_num': 64, 'shuffle_buffer_size': 1000, 'train_shuffle': True, 'trial': 0, 'word_contrastive': True, 'z_dim': 128} I1216 05:03:47.528530 140638140207296 train_utils.py:404] Starting training loop at step 1. /root/yes/envs/py39/lib/python3.9/site-packages/jax/_src/profiler.py:166: UserWarning: StepTraceContext has been renamed to StepTraceAnnotation. This alias will eventually be removed; please update your code. warnings.warn( Fatal Python error: Segmentation fault

Thread 0x00007fddbdffb700 (most recent call first): File "/root/yes/envs/py39/lib/python3.9/concurrent/futures/thread.py", line 75 in _worker File "/root/yes/envs/py39/lib/python3.9/threading.py", line 910 in run File "/root/yes/envs/py39/lib/python3.9/threading.py", line 973 in _bootstrap_inner File "/root/yes/envs/py39/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007fddbe7fc700 (most recent call first): File "/root/yes/envs/py39/lib/python3.9/concurrent/futures/thread.py", line 75 in _worker File "/root/yes/envs/py39/lib/python3.9/threading.py", line 910 in run File "/root/yes/envs/py39/lib/python3.9/threading.py", line 973 in _bootstrap_inner File "/root/yes/envs/py39/lib/python3.9/threading.py", line 930 in _bootstrap

Current thread 0x00007fe8de6390c0 (most recent call first): File "/root/yes/envs/py39/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 1955 in shape File "<__array_function__ internals>", line 5 in shape File "/root/yes/envs/py39/lib/python3.9/site-packages/jax/_src/api.py", line 1307 in File "/root/yes/envs/py39/lib/python3.9/site-packages/jax/_src/api.py", line 1307 in _mapped_axis_size File "/root/yes/envs/py39/lib/python3.9/site-packages/jax/_src/api.py", line 1633 in f_pmapped File "/root/yes/envs/py39/lib/python3.9/site-packages/jax/_src/api.py", line 1725 in f_pmapped File "/root/yes/envs/py39/lib/python3.9/site-packages/jax/_src/traceback_util.py", line 162 in reraise_with_filtered_traceback File "/xmc_gan/xmcgan/train_utils.py", line 424 in train File "/xmc_gan/xmcgan/main.py", line 62 in main File "/root/yes/envs/py39/lib/python3.9/site-packages/absl/app.py", line 251 in _run_main File "/root/yes/envs/py39/lib/python3.9/site-packages/absl/app.py", line 303 in run File "/xmc_gan/xmcgan/main.py", line 70 in File "/root/yes/envs/py39/lib/python3.9/runpy.py", line 87 in _run_code File "/root/yes/envs/py39/lib/python3.9/runpy.py", line 197 in _run_module_as_main train.sh: line 24: 45523 Segmentation fault (core dumped) CUDA_VISIBLE_DEVICES="0,1,2,3" python -m xmcgan.main --config="$CONFIG" --mode="train" --workdir="$WORKDIR"

details: config.batch_size = 8 config.d_step_per_g_step = 14

Have you ever come across this mistake?

hyeonjinXZ commented 2 years ago

I also encountered the Segmentation fault error. But when I changed Tensorflow version, I was able to address the problem. Did you use the author's uploaded requirement.txt?