Closed ssallys closed 8 years ago
ERROR : sh: 1: hadoop: not found
You need to configure the environment variable PATH, which contains the hadoop script,eg:PATH=$PATH:/opt/hadoop/bin
Thanks, jiadexin. Now, It does not complain that hadoop does not exist. But, still got an error of "IOError: File /tmp/session_mnist_try_1476098552130.meta does not exist".
I checked the folder and found the file exists. Any one knows the reason ?
Thanks. ------------- Msg ------------------ (tensorflow2.7) etri@n1:~/git/tensoronspark$ /usr/local/spark-2.0.0-bin-hadoop2.7/bin/spark-submit --master spark://masterIP:7077 tensorspark/example/run.py 2 I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally 16/10/10 20:22:28 INFO spark.SparkContext: Running Spark version 2.0.0 16/10/10 20:22:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/10/10 20:22:28 INFO spark.SecurityManager: Changing view acls to: etri 16/10/10 20:22:28 INFO spark.SecurityManager: Changing modify acls to: etri 16/10/10 20:22:28 INFO spark.SecurityManager: Changing view acls groups to: 16/10/10 20:22:28 INFO spark.SecurityManager: Changing modify acls groups to: 16/10/10 20:22:28 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(etri); groups with view permissions: Set(); users with modify permissions: Set(etri); groups with modify permissions: Set() 16/10/10 20:22:28 INFO util.Utils: Successfully started service 'sparkDriver' on port 33150. 16/10/10 20:22:28 INFO spark.SparkEnv: Registering MapOutputTracker 16/10/10 20:22:28 INFO spark.SparkEnv: Registering BlockManagerMaster 16/10/10 20:22:28 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-d50ca642-a778-4b4a-a12a-d006e89f2351 16/10/10 20:22:28 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB 16/10/10 20:22:28 INFO spark.SparkEnv: Registering OutputCommitCoordinator 16/10/10 20:22:29 INFO util.log: Logging initialized @2452ms 16/10/10 20:22:29 INFO server.Server: jetty-9.2.z-SNAPSHOT 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4b6b0f0c{/jobs,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@19149ab6{/jobs/json,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@58f3df8d{/jobs/job,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2a357214{/jobs/job/json,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@39b50b31{/stages,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4c9f79b{/stages/json,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@42004f38{/stages/stage,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@432420be{/stages/stage/json,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@52d94768{/stages/pool,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@19e71787{/stages/pool/json,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@595ee20c{/storage,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3b1d0849{/storage/json,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2367fa45{/storage/rdd,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@21d8a81{/storage/rdd/json,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2c7d464c{/environment,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7f715cca{/environment/json,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@63d834b3{/executors,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@d7ff3cb{/executors/json,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4bf68b0e{/executors/threadDump,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3e9419b6{/executors/threadDump/json,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1c4bdf7b{/static,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6ed89692{/,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1189cebc{/api,null,AVAILABLE} 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@fce2c92{/stages/stage/kill,null,AVAILABLE} 16/10/10 20:22:29 INFO server.ServerConnector: Started ServerConnector@36d685d7{HTTP/1.1}{0.0.0.0:4040} 16/10/10 20:22:29 INFO server.Server: Started @2575ms 16/10/10 20:22:29 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 16/10/10 20:22:29 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://129.254.164.75:4040 16/10/10 20:22:29 INFO util.Utils: Copying /home/etri/git/tensoronspark/tensorspark/example/run.py to /tmp/spark-dc93b935-2321-437b-b619-d39b8f899409/userFiles-30337c41-e059-4bff-902a-9af7e2b466b3/run.py 16/10/10 20:22:29 INFO spark.SparkContext: Added file file:/home/etri/git/tensoronspark/tensorspark/example/run.py at spark://129.254.164.75:33150/files/run.py with timestamp 1476098549242 16/10/10 20:22:29 INFO client.StandaloneAppClient$ClientEndpoint: Connecting to master spark://129.254.164.75:7077... 16/10/10 20:22:29 INFO client.TransportClientFactory: Successfully created connection to /129.254.164.75:7077 after 35 ms (0 ms spent in bootstraps) 16/10/10 20:22:29 INFO cluster.StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20161010202229-0019 16/10/10 20:22:29 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20161010202229-0019/0 on worker-20161005204757-129.254.164.76-44827 (129.254.164.76:44827) with 32 cores 16/10/10 20:22:29 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20161010202229-0019/0 on hostPort 129.254.164.76:44827 with 32 cores, 1024.0 MB RAM 16/10/10 20:22:29 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20161010202229-0019/1 on worker-20161005204757-129.254.164.77-57890 (129.254.164.77:57890) with 32 cores 16/10/10 20:22:29 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20161010202229-0019/1 on hostPort 129.254.164.77:57890 with 32 cores, 1024.0 MB RAM 16/10/10 20:22:29 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 58751. 16/10/10 20:22:29 INFO netty.NettyBlockTransferService: Server created on 129.254.164.75:58751 16/10/10 20:22:29 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 129.254.164.75, 58751) 16/10/10 20:22:29 INFO storage.BlockManagerMasterEndpoint: Registering block manager 129.254.164.75:58751 with 366.3 MB RAM, BlockManagerId(driver, 129.254.164.75, 58751) 16/10/10 20:22:29 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 129.254.164.75, 58751) 16/10/10 20:22:29 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20161010202229-0019/0 is now RUNNING 16/10/10 20:22:29 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20161010202229-0019/1 is now RUNNING 16/10/10 20:22:29 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@354c4459{/metrics/json,null,AVAILABLE} 16/10/10 20:22:29 INFO cluster.StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 16/10/10 20:22:30 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 260.5 KB, free 366.0 MB) 16/10/10 20:22:30 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.8 KB, free 366.0 MB) 16/10/10 20:22:30 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 129.254.164.75:58751 (size: 22.8 KB, free: 366.3 MB) 16/10/10 20:22:30 INFO spark.SparkContext: Created broadcast 0 from binaryFiles at NativeMethodAccessorImpl.java:-2 16/10/10 20:22:30 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 260.2 KB, free 365.8 MB) 16/10/10 20:22:30 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 22.8 KB, free 365.7 MB) 16/10/10 20:22:30 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 129.254.164.75:58751 (size: 22.8 KB, free: 366.3 MB) 16/10/10 20:22:30 INFO spark.SparkContext: Created broadcast 1 from binaryFiles at NativeMethodAccessorImpl.java:-2 16/10/10 20:22:30 INFO input.FileInputFormat: Total input paths to process : 1 16/10/10 20:22:30 INFO input.FileInputFormat: Total input paths to process : 1 16/10/10 20:22:30 INFO input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 3, size left: 0 16/10/10 20:22:30 INFO input.FileInputFormat: Total input paths to process : 1 16/10/10 20:22:30 INFO input.FileInputFormat: Total input paths to process : 1 16/10/10 20:22:30 INFO input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 3, size left: 0 16/10/10 20:22:31 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (129.254.164.77:47134) with ID 1 16/10/10 20:22:31 INFO storage.BlockManagerMasterEndpoint: Registering block manager 129.254.164.77:42622 with 366.3 MB RAM, BlockManagerId(1, 129.254.164.77, 42622) 16/10/10 20:22:31 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (129.254.164.76:47792) with ID 0 16/10/10 20:22:31 INFO storage.BlockManagerMasterEndpoint: Registering block manager 129.254.164.76:37675 with 366.3 MB RAM, BlockManagerId(0, 129.254.164.76, 37675) I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: name: GeForce GTX TITAN X major: 5 minor: 2 memoryClockRate (GHz) 1.076 pciBusID 0000:82:00.0 Total memory: 12.00GiB Free memory: 11.87GiB I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:82:00.0) 16/10/10 20:22:32 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 296.0 B, free 365.7 MB) 16/10/10 20:22:32 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 638.0 B, free 365.7 MB) 16/10/10 20:22:32 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 129.254.164.75:58751 (size: 638.0 B, free: 366.3 MB) 16/10/10 20:22:32 INFO spark.SparkContext: Created broadcast 2 from broadcast at PythonRDD.scala:482 16/10/10 20:22:32 INFO spark.SparkContext: Starting job: count at build/bdist.linux-x86_64/egg/tensorspark/core/spark_session.py:181 16/10/10 20:22:32 INFO scheduler.DAGScheduler: Registering RDD 6 (join at build/bdist.linux-x86_64/egg/tensorspark/example/spark_mnist.py:153) 16/10/10 20:22:32 INFO scheduler.DAGScheduler: Got job 0 (count at build/bdist.linux-x86_64/egg/tensorspark/core/spark_session.py:181) with 1 output partitions 16/10/10 20:22:32 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (count at build/bdist.linux-x86_64/egg/tensorspark/core/spark_session.py:181) 16/10/10 20:22:32 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0) 16/10/10 20:22:32 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0) 16/10/10 20:22:32 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[6] at join at build/bdist.linux-x86_64/egg/tensorspark/example/spark_mnist.py:153), which has no missing parents 16/10/10 20:22:32 INFO memory.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 14.0 KB, free 365.7 MB) 16/10/10 20:22:32 INFO memory.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 7.4 KB, free 365.7 MB) 16/10/10 20:22:32 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 129.254.164.75:58751 (size: 7.4 KB, free: 366.2 MB) 16/10/10 20:22:32 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1012 16/10/10 20:22:32 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (PairwiseRDD[6] at join at build/bdist.linux-x86_64/egg/tensorspark/example/spark_mnist.py:153) 16/10/10 20:22:32 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 16/10/10 20:22:32 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 129.254.164.77, partition 0, ANY, 5712 bytes) 16/10/10 20:22:32 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 129.254.164.76, partition 1, ANY, 5712 bytes) 16/10/10 20:22:32 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 0 on executor id: 1 hostname: 129.254.164.77. 16/10/10 20:22:32 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 1 on executor id: 0 hostname: 129.254.164.76. 16/10/10 20:22:33 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 129.254.164.76:37675 (size: 7.4 KB, free: 366.3 MB) 16/10/10 20:22:33 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 129.254.164.77:42622 (size: 7.4 KB, free: 366.3 MB) 16/10/10 20:22:33 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 129.254.164.77:42622 (size: 22.8 KB, free: 366.3 MB) 16/10/10 20:22:33 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 129.254.164.76:37675 (size: 22.8 KB, free: 366.3 MB) 16/10/10 20:22:35 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 2627 ms on 129.254.164.76 (1/2) 16/10/10 20:22:37 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 5033 ms on 129.254.164.77 (2/2) 16/10/10 20:22:37 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/10/10 20:22:37 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (join at build/bdist.linux-x86_64/egg/tensorspark/example/spark_mnist.py:153) finished in 5.060 s 16/10/10 20:22:37 INFO scheduler.DAGScheduler: looking for newly runnable stages 16/10/10 20:22:37 INFO scheduler.DAGScheduler: running: Set() 16/10/10 20:22:37 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1) 16/10/10 20:22:37 INFO scheduler.DAGScheduler: failed: Set() 16/10/10 20:22:37 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (PythonRDD[10] at count at build/bdist.linux-x86_64/egg/tensorspark/core/spark_session.py:181), which has no missing parents 16/10/10 20:22:37 INFO memory.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 11.6 KB, free 365.7 MB) 16/10/10 20:22:37 INFO memory.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 6.4 KB, free 365.7 MB) 16/10/10 20:22:37 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on 129.254.164.75:58751 (size: 6.4 KB, free: 366.2 MB) 16/10/10 20:22:37 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1012 16/10/10 20:22:37 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (PythonRDD[10] at count at build/bdist.linux-x86_64/egg/tensorspark/core/spark_session.py:181) 16/10/10 20:22:37 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 16/10/10 20:22:37 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, 129.254.164.77, partition 0, NODE_LOCAL, 5286 bytes) 16/10/10 20:22:37 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 2 on executor id: 1 hostname: 129.254.164.77. 16/10/10 20:22:37 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on 129.254.164.77:42622 (size: 6.4 KB, free: 366.3 MB) 16/10/10 20:22:37 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 129.254.164.77:47134 16/10/10 20:22:37 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 165 bytes 16/10/10 20:22:39 INFO storage.BlockManagerInfo: Added rdd_9_0 in memory on 129.254.164.77:42622 (size: 37.9 MB, free: 328.4 MB) 16/10/10 20:22:39 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 129.254.164.77:42622 (size: 638.0 B, free: 328.4 MB) 16/10/10 20:22:41 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, 129.254.164.77): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main process() File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func File "build/bdist.linux-x86_64/egg/tensorspark/core/spark_session.py", line 177, in _spark_run_fn File "build/bdist.linux-x86_64/egg/tensorspark/core/session_worker.py", line 34, in run self._run_fn(splitIndex, partition, self._param_bc.value) File "build/bdist.linux-x86_64/egg/tensorspark/core/session_worker.py", line 68, in _run_fn sutil.restore_session_hdfs(sess, user, session_path, session_meta_path, tmp_local_dir, host, port) File "build/bdist.linux-x86_64/egg/tensorspark/core/session_util.py", line 81, in restore_session_hdfs saver = tf.train.import_meta_graph(local_meta_path) File "/home/etri/anaconda3/envs/tensorflow2.7/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1458, in import_meta_graph return _import_meta_graph_def(read_meta_graph_file(meta_graph_or_file)) File "/home/etri/anaconda3/envs/tensorflow2.7/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1310, in read_meta_graph_file raise IOError("File %s does not exist." % filename) IOError: File /tmp/session_mnist_try_1476098552130.meta does not exist.
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/10/10 20:22:41 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 1.0 (TID 3, 129.254.164.77, partition 0, NODE_LOCAL, 5286 bytes)
16/10/10 20:22:41 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 3 on executor id: 1 hostname: 129.254.164.77.
^CTraceback (most recent call last):
File "/home/etri/git/tensoronspark/tensorspark/example/run.py", line 9, in
may be similar to this issue : https://github.com/liangfengsid/tensoronspark/issues/2
As the file does exist, it seems that the possible problem is that the file is not immediately ready when TensorFlow tries to restore from it. For example, the expected file may still be in another temporary file name right after the hdfs_util.get() function finishes (an assumption, the fact is yet to be verified). I add the logic to check if the file exists within a timeout (1 sec) before TensorFlow uses this file. Further, I fix the bug that may raise exceptions when trying to delete a replicated meta file with the same filename. Would you please try again with the latest commit?
I got an error when I run the test in pyspark shell: $ pyspark --deploy-mode=client --master=spark://129.254.164.75:7077
And also I got the same error when I run a python script with the followed command: $ spark-submit --master spark://masterIP:7077 tensorspark/example/run.py 2
The error message is as followed: I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: name: GeForce GTX TITAN X major: 5 minor: 2 memoryClockRate (GHz) 1.076 pciBusID 0000:82:00.0 Total memory: 12.00GiB Free memory: 11.87GiB I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:82:00.0) Worker 0 starts running sh: 1: hadoop: not found 16/10/05 22:44:59 ERROR executor.Executor: Exception in task 0.3 in stage 1.0 (TID 5) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main process() File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 2371, in pipeline_func return func(split, prev_func(split, iterator)) File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 2371, in pipeline_func return func(split, prev_func(split, iterator)) File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 2371, in pipeline_func return func(split, prev_func(split, iterator)) File "tensorspark/core/spark_session.py", line 177, in _spark_run_fn File "build/bdist.linux-x86_64/egg/tensorspark/core/session_worker.py", line 34, in run self._run_fn(splitIndex, partition, self._param_bc.value) File "build/bdist.linux-x86_64/egg/tensorspark/core/session_worker.py", line 68, in _run_fn sutil.restore_session_hdfs(sess, user, session_path, session_meta_path, tmp_local_dir, host, port) File "build/bdist.linux-x86_64/egg/tensorspark/core/session_util.py", line 81, in restore_session_hdfs saver = tf.train.import_meta_graph(local_meta_path) File "/home/etri/anaconda3/envs/tensorflow2.7/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1458, in import_meta_graph return _import_meta_graph_def(read_meta_graph_file(meta_graph_or_file)) File "/home/etri/anaconda3/envs/tensorflow2.7/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1310, in read_meta_graph_file raise IOError("File %s does not exist." % filename) IOError: File /tmp/session_mnist_try_1475675085625.meta does not exist.
16/10/05 22:49:18 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown 16/10/05 22:49:18 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM tdown
Have any ideas to tackle this problem?
Thanks