program hang when running yoloV3 sample on 0.12.0-snapshot docker container

Download az docker image from docker pull intelanalytics/hyper-zoo:0.12.0-SNAPSHOT Download yolov3 case from https://github.com/intel-analytics/analytics-zoo/tree/master/pyzoo/zoo/examples/orca/learn/tf2/yolov3

after install tensorflow and pyarrow pip install tensorflow==2.4.1 pip install pyarrow pip install aioredis==1.3.1 run this case on the docker container python yoloV3.py --data_dir ./ --weights yolov3.weights --class_num 20 --names voc2012.names

and the program hang at the start time, no process log show here's the log

root@workgpu:/apps# python yoloV3.py --data_dir ./ --weights yolov3.weights --class_num 20 --names voc2012.names 2021-10-13 07:01:32.434622: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-10-13 07:01:32.434638: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. Initializing orca context Current pyspark location is : /opt/spark/python/lib/pyspark.zip/pyspark/init.py Start to getOrCreate SparkContext pyspark_submit_args is: --driver-class-path /opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-jar-with-dependencies.jar pyspark-shell SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2021-10-13 07:01:34 WARN Utils:66 - Your hostname, workgpu resolves to a loopback address: 127.0.1.1; using 10.67.109.63 instead (on interface enp0s31f6) 2021-10-13 07:01:34 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address 2021-10-13 07:01:34 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

User settings:

KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=0 KMP_SETTINGS=1 OMP_NUM_THREADS=4

Effective settings:

KMP_ABORT_DELAY=0 KMP_ADAPTIVE_LOCK_PROPS='1,1024' KMP_ALIGN_ALLOC=64 KMP_ALL_THREADPRIVATE=128 KMP_ATOMIC_MODE=2 KMP_BLOCKTIME=0 KMP_CPUINFO_FILE: value is not defined KMP_DETERMINISTIC_REDUCTION=false KMP_DEVICE_THREAD_LIMIT=2147483647 KMP_DISP_HAND_THREAD=false KMP_DISP_NUM_BUFFERS=7 KMP_DUPLICATE_LIB_OK=false KMP_FORCE_REDUCTION: value is not defined KMP_FOREIGN_THREADS_THREADPRIVATE=true KMP_FORKJOIN_BARRIER='2,2' KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper' KMP_FORKJOIN_FRAMES=true KMP_FORKJOIN_FRAMES_MODE=3 KMP_GTID_MODE=3 KMP_HANDLE_SIGNALS=false KMP_HOT_TEAMS_MAX_LEVEL=1 KMP_HOT_TEAMS_MODE=0 KMP_INIT_AT_FORK=true KMP_ITT_PREPARE_DELAY=0 KMP_LIBRARY=throughput KMP_LOCK_KIND=queuing KMP_MALLOC_POOL_INCR=1M KMP_MWAIT_HINTS=0 KMP_NUM_LOCKS_IN_BLOCK=1 KMP_PLAIN_BARRIER='2,2' KMP_PLAIN_BARRIER_PATTERN='hyper,hyper' KMP_REDUCTION_BARRIER='1,1' KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper' KMP_SCHEDULE='static,balanced;guided,iterative' KMP_SETTINGS=true KMP_SPIN_BACKOFF_PARAMS='4096,100' KMP_STACKOFFSET=64 KMP_STACKPAD=0 KMP_STACKSIZE=8M KMP_STORAGE_MAP=false KMP_TASKING=2 KMP_TASKLOOP_MIN_TASKS=0 KMP_TASK_STEALING_CONSTRAINT=1 KMP_TEAMS_THREAD_LIMIT=16 KMP_TOPOLOGY_METHOD=all KMP_USER_LEVEL_MWAIT=false KMP_USE_YIELD=1 KMP_VERSION=false KMP_WARNINGS=true OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}' OMP_ALLOCATOR=omp_default_mem_alloc OMP_CANCELLATION=false OMP_DEBUG=disabled OMP_DEFAULT_DEVICE=0 OMP_DISPLAY_AFFINITY=false OMP_DISPLAY_ENV=false OMP_DYNAMIC=false OMP_MAX_ACTIVE_LEVELS=2147483647 OMP_MAX_TASK_PRIORITY=0 OMP_NESTED=false OMP_NUM_THREADS='4' OMP_PLACES: value is not defined OMP_PROC_BIND='intel' OMP_SCHEDULE='static' OMP_STACKSIZE=8M OMP_TARGET_OFFLOAD=DEFAULT OMP_THREAD_LIMIT=2147483647 OMP_TOOL=enabled OMP_TOOL_LIBRARIES: value is not defined OMP_WAIT_POLICY=PASSIVE KMP_AFFINITY='noverbose,warnings,respect,granularity=fine,compact,1,0'

cls.getname: com.intel.analytics.bigdl.python.api.Sample BigDLBasePickler registering: bigdl.util.common Sample cls.getname: com.intel.analytics.bigdl.python.api.EvaluatedResult BigDLBasePickler registering: bigdl.util.common EvaluatedResult cls.getname: com.intel.analytics.bigdl.python.api.JTensor BigDLBasePickler registering: bigdl.util.common JTensor cls.getname: com.intel.analytics.bigdl.python.api.JActivity BigDLBasePickler registering: bigdl.util.common JActivity Successfully got a SparkContext 2021-10-13 07:01:37,060 INFO services.py:1174 -- View the Ray dashboard at http://10.67.109.63:8265 2021-10-13 07:01:37,062 WARNING services.py:1628 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=Xgb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 2gb. {'node_ip_address': '10.67.109.63', 'raylet_ip_address': '10.67.109.63', 'redis_address': '10.67.109.63:22512', 'object_store_address': '/tmp/ray/session_2021-10-13_07-01-36_425642_4250/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2021-10-13_07-01-36_425642_4250/sockets/raylet', 'webui_url': '10.67.109.63:8265', 'session_dir': '/tmp/ray/session_2021-10-13_07-01-36_425642_4250', 'metrics_export_port': 45616, 'node_id': '87251d54b569d8e0b12621e02749458ce3df693513d903ec76d89a83'} 2021-10-13 07:01:38.421235: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-10-13 07:01:38.421397: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2021-10-13 07:01:38.421411: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303) 2021-10-13 07:01:38.421428: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (workgpu): /proc/driver/nvidia/version does not exist 2021-10-13 07:01:38.421598: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-10-13 07:01:38.422091: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set [Stage 0:> (0 + 1) / 1]2021-10-13 07:01:42.028476: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-10-13 07:01:42.028492: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. [Stage 1:> (0 + 4) / 4]2021-10-13 07:01:45.034699: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-10-13 07:01:45.034699: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-10-13 07:01:45.034699: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-10-13 07:01:45.034716: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2021-10-13 07:01:45.034716: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2021-10-13 07:01:45.034716: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2021-10-13 07:01:45.083740: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-10-13 07:01:45.083757: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. [Stage 3:============================================> (3 + 1) / 4]2021-10-13 07:01:47.692089: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-10-13 07:01:47.692103: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. [Stage 5:============================================> (3 + 1) / 4]2021-10-13 07:01:49.562167: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-10-13 07:01:49.562186: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. [Stage 7:============================================> (3 + 1) / 4]2021-10-13 07:01:51.366770: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-10-13 07:01:51.366783: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. [Stage 9:============================================> (3 + 1) / 4]2021-10-13 07:01:53.397010: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-10-13 07:01:53.397023: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. [Stage 11:===========================================> (3 + 1) / 4]2021-10-13 07:01:55.184923: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-10-13 07:01:55.184938: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. [Stage 13:===========================================> (3 + 1) / 4]2021-10-13 07:01:56.965577: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-10-13 07:01:56.965591: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2021-10-13 07:01:58.722887: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-10-13 07:01:58.722901: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2021-10-13 07:01:59.789241: I tensorflow/core/profiler/lib/profiler_session.cc:136] Profiler session initializing. 2021-10-13 07:01:59.789267: I tensorflow/core/profiler/lib/profiler_session.cc:155] Profiler session started. 2021-10-13 07:01:59.789314: I tensorflow/core/profiler/lib/profiler_session.cc:172] Profiler session tear down. (pid=4475) 2021-10-13 07:02:00.196573: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory (pid=4475) 2021-10-13 07:02:00.196603: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. (pid=4475) WARNING:tensorflow:From /opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/learn/tf2/tf_runner.py:317: _CollectiveAllReduceStrategyExperimental.init (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version. (pid=4475) Instructions for updating: (pid=4475) use distribute.MultiWorkerMirroredStrategy instead (pid=4475) 2021-10-13 07:02:01.164408: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set (pid=4475) 2021-10-13 07:02:01.164526: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory (pid=4475) 2021-10-13 07:02:01.164536: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303) (pid=4475) 2021-10-13 07:02:01.164546: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (workgpu): /proc/driver/nvidia/version does not exist (pid=4475) 2021-10-13 07:02:01.164948: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA (pid=4475) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (pid=4475) 2021-10-13 07:02:01.165041: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set (pid=4475) 2021-10-13 07:02:01.165224: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set (pid=4475) 2021-10-13 07:02:01.198298: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.67.109.63:44639} (pid=4475) 2021-10-13 07:02:01.198506: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://10.67.109.63:44639 (pid=4475) WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-8 (pid=4475) WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-9 (pid=4475) WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-10 (pid=4475) WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-11 (pid=4475) WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.

intel-analytics / analytics-zoo

program hang when running yoloV3 sample on 0.12.0-snapshot docker container #49