intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
18 stars 4 forks source link

RuntimeError: Could not open the file ./gateway_port, #117

Closed Adria777 closed 3 years ago

Adria777 commented 3 years ago

I can successfully run /orca/learn/tf2/yolov3/yoloV3.py in local mode and client mode with latest hyperzoo image, but I could not run this in cluster mode. The error is :

Traceback (most recent call last): File "/opt/analytics-zoo-examples/python/orca/learn/tf2/yolov3/yoloV3.py", line 682, in main() File "/opt/analytics-zoo-examples/python/orca/learn/tf2/yolov3/yoloV3.py", line 653, in main object_store_memory=options.object_store_memory) File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/orca/common.py", line 265, in init_orca_context File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 546, in init File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 574, in _start_cluster File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(18, 2) finished unsuccessfully. org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 566, in _get_port with open(path) as f: FileNotFoundError: [Errno 2] No such file or directory: './gateway_port'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main process() File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 400, in dump_stream vs = list(itertools.islice(iterator, batch)) File "/opt/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 294, in _start_ray_services File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 207, in _start_ray_node JVMGuard.register_pgid(process_info.pgid) File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 51, in register_pgid raise err File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/ray/raycontext.py", line 46, in register_pgid pgid) File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/zoo/common/utils.py", line 150, in callZooFunc gateway = _get_gateway() File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 579, in _get_gateway gateway_port = _get_port() File "./analytics-zoo-bigdl_0.13.0-spark_2.4.6-0.12.0-SNAPSHOT-python-api.zip/bigdl/util/common.py", line 573, in _get_port " executor side." % e.filename) RuntimeError: Could not open the file ./gateway_port, which contains the listening port of local Java Gateway, please make sure the init_executor_gateway() function is called before any call of java function on the executor side.

This is how I submit the job:

${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \ --master k8s://https://127.0.0.1:8443 \ --deploy-mode cluster \ --name tf2 \ --conf spark.kubernetes.container.image="10.239.45.10/arda/hyper-zoo:tf2" \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \ --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.options.claimName="nfsvolumeclaim" \ --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.nfsvolumeclaim.mount.path="/zoo" \ --conf spark.executor.instances=3 \ --conf spark.driver.host="XX" \ --conf spark.driver.port="XX" \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --executor-memory 50g \ --driver-memory 50g \ --executor-cores 8 \ --num-executors 3 \ --total-executor-cores 24 \ file:///opt/analytics-zoo-examples/python/orca/learn/tf2/yolov3/yoloV3.py --data_dir /zoo/data/voc2009_raw --weights /zoo/data/yolov3.weights --class_num 20 --names /zoo/data/voc2012.names

Adria777 commented 3 years ago

can not be reproduced