Summary of the problems encountered in using analytics-zoo for tensorflow-based SparkOnYarn mode this time

Wercurial commented 4 years ago

0.When I finally solved this problem, I found that the latest code on github seems to have solved this problem, but when installed through pip, this problem still exists

1.use the case is 'path: https://github.com/intel-analytics/analytics-zoo/tree/master/pyzoo/zoo/examples/tensorflow/tfpark' and 'demo: Run the TFEstimator example after pip install'

2.python package version(os: centos7.x,spark: 2.4.3,hadoop: 2.7.7)

absl-py==0.9.0
analytics-zoo==0.8.1
astor==0.8.1
BigDL==0.10.0
certifi==2020.6.20
conda-pack==0.3.1
gast==0.2.2
google-pasta==0.2.0
grpcio==1.31.0
h5py==2.10.0
importlib-metadata==1.7.0
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.2
Markdown==3.2.2
numpy==1.19.1
opt-einsum==3.3.0
protobuf==3.13.0
py4j==0.10.7
pyspark==2.4.3
six==1.15.0
tensorboard==1.15.0
tensorflow==1.15.0
tensorflow-estimator==1.15.1
termcolor==1.1.0
Werkzeug==1.0.1
wrapt==1.12.1
zipp==3.1.0

3.bug as following

: java.lang.VerifyError: (class: org/tensorflow/util/Event$Builder, method: getTaggedRunMetadataFieldBuilder signature: ()Lcom/intel/analytics/bigdl/shaded/protobuf/SingleFieldBuilderV3;) Incompatible argument to function
        at org.tensorflow.util.Event.toBuilder(Event.java:694)
        at org.tensorflow.util.Event.newBuilder(Event.java:688)
        at com.intel.analytics.bigdl.visualization.tensorboard.EventWriter.<init>(EventWriter.scala:40)
        at com.intel.analytics.bigdl.visualization.tensorboard.FileWriter.<init>(FileWriter.scala:39)

4.problem analysis

The java.lang.VerifyError: Incompatible argument to function error is usually caused by the conflict of two classes. By analyzing the location of the org.tensorflow.uti.Event class, it can be known that the class exists in two packages, and the two packages are in When submitting pyspark tasks, the order of submission caused conflicts

5.Solution: solve the problem by modifying the jar submission order under def init_spark_on_yarn() in the python package https://github.com/intel-analytics/analytics-zoo/blob/master/pyzoo/zoo/util/spark.py

return command + " --driver-class-path {}:{}".\
                format(self._get_bigdl_classpath_jar_name_on_driver()[0], self._get_zoo_classpath_jar_name_on_driver()[0])

hkvision commented 4 years ago

This issue is related to https://github.com/intel-analytics/analytics-zoo-internal/issues/696 for the order of BigDL and Analytics Zoo jars. We are working on it. Thanks so much for your information! @Wercurial

yangw1234 commented 4 years ago

This has been fixed. Close it now.

intel-analytics / analytics-zoo

Summary of the problems encountered in using analytics-zoo for tensorflow-based SparkOnYarn mode this time #669