sequenceiq / docker-spark

Apache License 2.0
765 stars 282 forks source link

Failure for TFSparkNode.mgr is NULL #70

Closed cancoo closed 7 years ago

cancoo commented 7 years ago

Hello teams,

I run TFoS on Hadoop Cluster, everything goes well but, while training step, it was Hung and meet this Error as below, could you have a look? Thanks.

File "/root/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1506665300625_0002/container_1506665300625_0002_01_000008/tfspark.zip/tensorflowonspark/TFSparkNode.py", line 66, in _get_manager logging.info("Connected to TFSparkNode.mgr on {0}, ppid={1}, state={2}".format(host, ppid, str(TFSparkNode.mgr.get('state')))) AttributeError: 'NoneType' object has no attribute 'get'

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more

================================================