intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
18 stars 4 forks source link

Cluster Serving Tensorflow backend loading failed due to Duplicate registration of device factory for type XLA_CPU #103

Closed qiyuangong closed 3 years ago

qiyuangong commented 3 years ago

Submitting Tensorflow inference jobs to same Flink taskmanager will encounter following error during loading tensorflow libs

2021-09-13 18:26:11.363582: F tensorflow/core/common_runtime/device_factory.cc:78] Duplicate registration of device factory for type XLA_CPU with the same priority 50

First job loading log

linux-x86_64/libiomp5.so
linux-x86_64/libmklml_intel.so
linux-x86_64/libtensorflow_framework-zoo.so
linux-x86_64/libtensorflow_jni.so
2021-09-13 16:58:03.641916: I tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /tmp/flink-dist-cache-f3ed34d3-7966-41e6-8619-62bf7c2201b0/7b3dd6f4328922f26a5b2a421349293e/tf_res
50_saved
2021-09-13 16:58:03.695482: I tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2021-09-13 16:58:03.787321: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance c
ritical operations:  AVX2 AVX512F FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-13 16:58:03.818510: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2021-09-13 16:58:03.820526: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f9f68eb2010 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-09-13 16:58:03.820581: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-09-13 16:58:03.820732: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best per
formance.
2021-09-13 16:58:04.084783: I tensorflow/cc/saved_model/loader.cc:202] Restoring SavedModel bundle.
2021-09-13 16:58:04.935499: I tensorflow/cc/saved_model/loader.cc:151] Running initialization op on SavedModel bundle at path: /tmp/flink-dist-cache-f3ed34d3-7966-41e6-8619-62bf7c2201b0/7b3dd6f
4328922f26a5b2a421349293e/tf_res50_saved
2021-09-13 16:58:05.118945: I tensorflow/cc/saved_model/loader.cc:311] SavedModel load for tags { serve }; Status: success. Took 1477059 microseconds.

Second job load, then failed

2021-09-13 16:58:04.084783: I tensorflow/cc/saved_model/loader.cc:202] Restoring SavedModel bundle.
2021-09-13 16:58:04.935499: I tensorflow/cc/saved_model/loader.cc:151] Running initialization op on SavedModel bundle at path: /tmp/flink-dist-cache-f3ed34d3-7966-41e6-8619-62bf7c2201b0/7b3dd6f
4328922f26a5b2a421349293e/tf_res50_saved
2021-09-13 16:58:05.118945: I tensorflow/cc/saved_model/loader.cc:311] SavedModel load for tags { serve }; Status: success. Took 1477059 microseconds.
linux-x86_64/libiomp5.so
linux-x86_64/libmklml_intel.so
linux-x86_64/libtensorflow_framework-zoo.so
linux-x86_64/libtensorflow_jni.so
2021-09-13 18:26:11.363582: F tensorflow/core/common_runtime/device_factory.cc:78] Duplicate registration of device factory for type XLA_CPU with the same priority 50
Litchilitchy commented 3 years ago

Maybe could be reproduced by using same process, second time loading

I tried to reproduce this by

But both not reproduced successfully.

Seems when Flink task ends, some info are cleaned, but this is not cleaned in TF core management and causing repeated loading.

qiyuangong commented 3 years ago

Maybe could be reproduced by using same process, second time loading

I tried to reproduce this by

  • load TF and predict, load again and predict again.
  • use a sub-thread to load and predict, stop the thread and start another same thread again.

But both not reproduced successfully.

Seems when Flink task ends, some info are cleaned, but this is not cleaned in TF core management and causing repeated loading.

Checked with @Litchilitchy on JDK-8 and JDK-11. Simple Java example can not re-product this error. Seems it is related to Flink's classloader design. https://blog.csdn.net/lijianqingfeng/article/details/107093632

qiyuangong commented 3 years ago

Problem solved by putting jar into flink/lib. :) Issue closed

jason-dai commented 3 years ago

Need to update the document?

qiyuangong commented 3 years ago

Need to update the document?

Yes. Will add to Cluster Serving trouble shooting.