I can successfully run /orca/learn/tf2/yolov3/yoloV3.py in local mode and client mode with latest hyperzoo image, but I could not run this in cluster mode.
The error is :
Traceback (most recent call last):
File "/opt/analytics-zoo-examples/python/orca/learn/tf2/yolov3/yoloV3.py", line 682, in
main()
File "/opt/analytics-zoo-examples/python/orca/learn/tf2/yolov3/yoloV3.py", line 653, in main
object_store_memory=options.object_store_memory)
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/common.py", line 265, in init_orca_context
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 546, in init
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 574, in _start_cluster
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(18, 2) finished unsuccessfully.
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 566, in _get_port
with open(path) as f:
FileNotFoundError: [Errno 2] No such file or directory: './gateway_port'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
process()
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 400, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 294, in _start_ray_services
File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 207, in _start_ray_node
JVMGuard.register_pgid(process_info.pgid)
File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 51, in register_pgid
raise err
File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 46, in register_pgid
pgid)
File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/common/utils.py", line 150, in callZooFunc
gateway = _get_gateway()
File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 579, in _get_gateway
gateway_port = _get_port()
File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 573, in _get_port
" executor side." % e.filename)
RuntimeError: Could not open the file ./gateway_port, which contains the listening port of local Java Gateway, please make sure the init_executor_gateway() function is called before any call of java function on the executor side.
I can successfully run /orca/learn/tf2/yolov3/yoloV3.py in local mode and client mode with latest hyperzoo image, but I could not run this in cluster mode. The error is :
Traceback (most recent call last): File "/opt/analytics-zoo-examples/python/orca/learn/tf2/yolov3/yoloV3.py", line 682, in
main()
File "/opt/analytics-zoo-examples/python/orca/learn/tf2/yolov3/yoloV3.py", line 653, in main
object_store_memory=options.object_store_memory)
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/common.py", line 265, in init_orca_context
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 546, in init
File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 574, in _start_cluster
File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(18, 2) finished unsuccessfully.
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 566, in _get_port
with open(path) as f:
FileNotFoundError: [Errno 2] No such file or directory: './gateway_port'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main process() File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 400, in dump_stream vs = list(itertools.islice(iterator, batch)) File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 294, in _start_ray_services File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 207, in _start_ray_node JVMGuard.register_pgid(process_info.pgid) File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 51, in register_pgid raise err File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 46, in register_pgid pgid) File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/common/utils.py", line 150, in callZooFunc gateway = _get_gateway() File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 579, in _get_gateway gateway_port = _get_port() File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 573, in _get_port " executor side." % e.filename) RuntimeError: Could not open the file ./gateway_port, which contains the listening port of local Java Gateway, please make sure the init_executor_gateway() function is called before any call of java function on the executor side.
This is how I submit the job:
${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \ --master k8s://https://127.0.0.1:8443 \ --deploy-mode cluster \ --name tf2 \ --conf spark.kubernetes.container.image="10.239.45.10/arda/hyper-zoo:tf2" \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \ --conf spark.executor.instances=3 \ --conf spark.driver.host="XX" \ --conf spark.driver.port="XX" \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --executor-memory 50g \ --driver-memory 50g \ --executor-cores 8 \ --num-executors 3 \ --total-executor-cores 24 \ file:///opt/analytics-zoo-examples/python/orca/learn/tf2/yolov3/yoloV3.py --data_dir /zoo/data/voc2009_raw --weights /zoo/data/yolov3.weights --class_num 20 --names /zoo/data/voc2012.names