google-research / vision_transformer

Apache License 2.0
10.55k stars 1.3k forks source link

Segmentation fault when fine-tuning #174

Open yjqiu opened 2 years ago

yjqiu commented 2 years ago

Hi, I am trying to fine-tune b16 model on CIFAR-10 with the following commands on Google Cloud VM with 1 V100 GPU. I got a segmentation fault error.

Command

python -m vit_jax.main --workdir=${work_dir} --config=${src_dir}/vit_jax/configs/vit.py:b16,cifar10 --config.pretrained_dir='gs://vit_models/imagenet21k' --config.accum_steps=64 --config.warmup_steps=50 --config.total_steps=500

Error with recent logs

W0414 16:33:32.212276 139680381712192 dispatch.py:184] Finished XLA compilation of update_fn in 45.82162594795227 sec
I0414 16:33:49.613650 139680381712192 train.py:185] First step took 79.3 seconds.
I0414 16:34:55.306408 139680381712192 local.py:41] Setting work unit notes: 0.1 steps/s, 1.0% (5/500), ETA: 2h15m
I0414 16:34:55.307062 139680381712192 logging_writer.py:35] [5] steps_per_sec=0.060890
I0414 16:36:01.285386 139680381712192 local.py:41] Setting work unit notes: 0.1 steps/s, 1.8% (9/500), ETA: 2h14m
I0414 16:36:01.286010 139680381712192 logging_writer.py:35] [9] steps_per_sec=0.060625
2022-04-14 16:36:17.794655: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcupti.so.11.4'; dlerror: libcupti.so.11.4: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
I0414 16:36:18.475741 139680381712192 logging_writer.py:35] [10] img_sec_core_train=30.955834, train_loss=2.295228
I0414 16:36:18.481024 139680381712192 train.py:200] Step: 10/500 2.0%, img/sec/core: 31.0, ETA: 2.03h
I0414 16:37:08.835929 139680381712192 local.py:41] Setting work unit notes: 0.1 steps/s, 2.6% (13/500), ETA: 2h17m
I0414 16:37:08.836521 139680381712192 logging_writer.py:35] [13] steps_per_sec=0.059215
I0414 16:38:03.844941 139680381712192 local.py:51] Created artifact [10] Profile of type ArtifactType.URL and value None.
Fatal Python error: Segmentation fault

Thread 0x00007f088cd77700 (most recent call first):
  File "/opt/conda/lib/python3.7/threading.py", line 300 in wait
  File "/opt/conda/lib/python3.7/threading.py", line 552 in wait
  File "/opt/conda/lib/python3.7/threading.py", line 1175 in run
  File "/opt/conda/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/opt/conda/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f09df88b740 (most recent call first):
  File "/home/yunjiang/vision_transformer/env/lib/python3.7/site-packages/jax/_src/dispatch.py", line 674 in _device_put_array
  File "/home/yunjiang/vision_transformer/env/lib/python3.7/site-packages/jax/_src/dispatch.py", line 663 in device_put
  File "/home/yunjiang/vision_transformer/env/lib/python3.7/site-packages/jax/_src/api.py", line 2712 in <listcomp>
  File "/home/yunjiang/vision_transformer/env/lib/python3.7/site-packages/jax/_src/api.py", line 2711 in _device_put_sharded
  File "/home/yunjiang/vision_transformer/env/lib/python3.7/site-packages/jax/_src/tree_util.py", line 180 in <genexpr>
  File "/home/yunjiang/vision_transformer/env/lib/python3.7/site-packages/jax/_src/tree_util.py", line 180 in tree_map
  File "/home/yunjiang/vision_transformer/env/lib/python3.7/site-packages/jax/_src/api.py", line 2716 in device_put_sharded
  File "/home/yunjiang/vision_transformer/env/lib/python3.7/site-packages/flax/jax_utils.py", line 146 in _prefetch
  File "/home/yunjiang/vision_transformer/env/lib/python3.7/site-packages/jax/_src/tree_util.py", line 180 in <genexpr>
run.sh: line 11: 23092 Segmentation fault      python -m vit_jax.main --workdir=${work_dir} --config=${src_dir}/vit_jax/configs/vit.py:b16,cifar10 --config.pretrained_dir='gs://vit_models/imagenet21k' --config.accum_steps=64 --config.warmup_steps=50 --config.total_steps=500
iamstarlee commented 5 months ago

I got exactly the same question like you, any ideas?