nchammas / flintrock

A command-line tool for launching Apache Spark clusters.
Apache License 2.0
637 stars 116 forks source link

Fail to launch jobs on Amazon EC2 with master-IP #220

Closed xiandong79 closed 6 years ago

xiandong79 commented 6 years ago

I try to run several spark-bench benchmarks on EC2 launched by flintrock.

In the standalone mode of spark, I have to configure the IP address of EC2 machine as the "master" of spark.

The console and .conf is below:

[ec2-user@ip-172-31-15-16 new-spark-bench_2.1.1]$ ./bin/spark-bench.sh examples/kmeans.conf
 *** SPARK-SUBMIT: [/home/ec2-user/spark/bin/spark-submit, --class, com.ibm.sparktc.sparkbench.cli.CLIKickoff, --master, spark://34.215.233.221:7077, /home/ec2-user/new-spark-bench_2.1.1/lib/spark-bench-2.1.1_0.2.2-RELEASE.jar, {"spark-bench":{"spark-submit-config":[{"spark-args":{"master":"spark://34.215.233.221:7077"},"workload-suites":[{"benchmark-output":"console","descr":"datagen kmeans","workloads":[{"cols":4,"name":"data-generation-kmeans","output":"file:///tmp/kmeans-data.csv","parititions":32,"rows":10}]},{"benchmark-output":"console","descr":"run kmeans","workloads":[{"input":"file:///tmp/KMeen.csv","name":"kmeans"}]}]}]}}]
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/11/28 09:04:11 INFO SparkContext: Running Spark version 2.2.0
17/11/28 09:04:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/28 09:04:12 INFO SparkContext: Submitted application: com.ibm.sparktc.sparkbench.cli.CLIKickoff
17/11/28 09:04:12 INFO SecurityManager: Changing view acls to: ec2-user
17/11/28 09:04:12 INFO SecurityManager: Changing modify acls to: ec2-user
17/11/28 09:04:12 INFO SecurityManager: Changing view acls groups to:
17/11/28 09:04:12 INFO SecurityManager: Changing modify acls groups to:
17/11/28 09:04:12 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ec2-user); groups with view permissions: Set(); users  with modify permissions: Set(ec2-user); groups with modify permissions: Set()
17/11/28 09:04:12 INFO Utils: Successfully started service 'sparkDriver' on port 37999.
17/11/28 09:04:12 INFO SparkEnv: Registering MapOutputTracker
17/11/28 09:04:12 INFO SparkEnv: Registering BlockManagerMaster
17/11/28 09:04:12 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/11/28 09:04:12 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/11/28 09:04:12 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-f402d6cf-fed7-442c-abe5-a9ae2167051d
17/11/28 09:04:12 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/11/28 09:04:12 INFO SparkEnv: Registering OutputCommitCoordinator
17/11/28 09:04:13 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/11/28 09:04:13 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://172.31.15.16:4040
17/11/28 09:04:13 INFO SparkContext: Added JAR file:/home/ec2-user/new-spark-bench_2.1.1/lib/spark-bench-2.1.1_0.2.2-RELEASE.jar at spark://172.31.15.16:37999/jars/spark-bench-2.1.1_0.2.2-RELEASE.jar with timestamp 1511859853321
17/11/28 09:04:13 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://34.215.233.221:7077...
17/11/28 09:04:33 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://34.215.233.221:7077...
17/11/28 09:04:53 INFO StandaloneAppClient$ClientEndpoint: Connecting to master 
xiandong79 commented 6 years ago

@ncherel

xiandong79 commented 6 years ago
  1. I use flintrock - https://github.com/nchammas/flintrock to lanuch a spark cluster(1 + 4slaves)

Spark version 2.1.1 Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151)

(I can also lanuch a Spark version 2.2.0)

  1. it works well with local
    
    # For Scala and Java, use run-example:
    ./bin/run-example SparkPi

For Python examples, use spark-submit directly:

./bin/spark-submit examples/src/main/python/pi.py


3.   it does not work with `standalone`

./bin/spark-submit --master spark://35.162.130.151:7077 examples/src/main/python/pi.py 100

xiandong79 commented 6 years ago
17/11/28 15:45:36 INFO SecurityManager: Changing view acls to: ec2-user
17/11/28 15:45:36 INFO SecurityManager: Changing modify acls to: ec2-user
17/11/28 15:45:36 INFO SecurityManager: Changing view acls groups to:
17/11/28 15:45:36 INFO SecurityManager: Changing modify acls groups to:
17/11/28 15:45:36 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ec2-user); groups with view permissions: Set(); users  with modify permissions: Set(ec2-user); groups with modify permissions: Set()
17/11/28 15:45:36 INFO Utils: Successfully started service 'sparkDriver' on port 32937.
17/11/28 15:45:36 INFO SparkEnv: Registering MapOutputTracker
17/11/28 15:45:36 INFO SparkEnv: Registering BlockManagerMaster
17/11/28 15:45:36 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/11/28 15:45:36 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/11/28 15:45:36 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b05bab88-e040-4614-9665-bbeeea5f5c94
17/11/28 15:45:36 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/11/28 15:45:37 INFO SparkEnv: Registering OutputCommitCoordinator
17/11/28 15:45:37 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/11/28 15:45:37 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://172.31.6.28:4040
17/11/28 15:45:37 INFO SparkContext: Added file file:/home/ec2-user/spark/examples/src/main/python/pi.py at spark://172.31.6.28:32937/files/pi.py with timestamp 1511883937418
17/11/28 15:45:37 INFO Utils: Copying /home/ec2-user/spark/examples/src/main/python/pi.py to /tmp/spark-eb3b751f-7e90-49d2-b1be-ab5fa1fd4eb1/userFiles-e6575cb1-3e15-4f5b-b3e0-3d707eea6e9a/pi.py
17/11/28 15:45:37 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://35.162.130.151:7077...
17/11/28 15:45:37 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 35.162.130.151:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
    at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:106)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to /35.162.130.151:7077
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
    ... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /35.162.130.151:7077
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:640)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    ... 1 more
17/11/28 15:45:57 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://35.162.130.151:7077...
17/11/28 15:45:57 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 35.162.130.151:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
    at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:106)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to /35.162.130.151:7077
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
    ... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /35.162.130.151:7077
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:640)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    ... 1 more
17/11/28 15:46:17 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://35.162.130.151:7077...
17/11/28 15:46:17 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 35.162.130.151:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
    at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:106)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to /35.162.130.151:7077
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
    ... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /35.162.130.151:7077
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:640)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    ... 1 more
17/11/28 15:46:37 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
17/11/28 15:46:37 WARN StandaloneSchedulerBackend: Application ID is not initialized yet.
17/11/28 15:46:37 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35355.
17/11/28 15:46:37 INFO NettyBlockTransferService: Server created on 172.31.6.28:35355
17/11/28 15:46:37 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/11/28 15:46:37 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 172.31.6.28, 35355, None)
17/11/28 15:46:37 INFO BlockManagerMasterEndpoint: Registering block manager 172.31.6.28:35355 with 366.3 MB RAM, BlockManagerId(driver, 172.31.6.28, 35355, None)
17/11/28 15:46:37 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 172.31.6.28, 35355, None)
17/11/28 15:46:37 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 172.31.6.28, 35355, None)
17/11/28 15:46:37 INFO SparkUI: Stopped Spark web UI at http://172.31.6.28:4040
17/11/28 15:46:37 INFO StandaloneSchedulerBackend: Shutting down all executors
17/11/28 15:46:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
17/11/28 15:46:37 WARN StandaloneAppClient$ClientEndpoint: Drop UnregisterApplication(null) because has not yet connected to master
17/11/28 15:46:37 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/11/28 15:46:37 INFO MemoryStore: MemoryStore cleared
17/11/28 15:46:37 INFO BlockManager: BlockManager stopped
17/11/28 15:46:37 INFO BlockManagerMaster: BlockManagerMaster stopped
17/11/28 15:46:37 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/11/28 15:46:37 INFO SparkContext: Successfully stopped SparkContext
17/11/28 15:46:37 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
    at scala.Predef$.require(Predef.scala:224)
    at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:524)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:236)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
17/11/28 15:46:37 INFO SparkContext: SparkContext already stopped.
Traceback (most recent call last):
  File "/home/ec2-user/spark/examples/src/main/python/pi.py", line 32, in <module>
    .appName("PythonPi")\
  File "/home/ec2-user/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 169, in getOrCreate
  File "/home/ec2-user/spark/python/lib/pyspark.zip/pyspark/context.py", line 310, in getOrCreate
  File "/home/ec2-user/spark/python/lib/pyspark.zip/pyspark/context.py", line 118, in __init__
  File "/home/ec2-user/spark/python/lib/pyspark.zip/pyspark/context.py", line 182, in _do_init
  File "/home/ec2-user/spark/python/lib/pyspark.zip/pyspark/context.py", line 249, in _initialize_context
  File "/home/ec2-user/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__
  File "/home/ec2-user/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
    at scala.Predef$.require(Predef.scala:224)
    at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:524)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:236)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
nchammas commented 6 years ago

What does flintrock describe <cluster_name> show as the master address? If you use that instead of the IP address, does the Pi example work?

xiandong79 commented 6 years ago

flintrock --config us-east-m4-4.yaml describe us-east-m4-4

us-east-m4-4:
  state: running
  node-count: 5
  master: ec2-34-228-165-101.compute-1.amazonaws.com
  slaves:
    - ec2-52-204-167-201.compute-1.amazonaws.com
    - ec2-34-228-79-253.compute-1.amazonaws.com
    - ec2-34-207-145-126.compute-1.amazonaws.com
    - ec2-54-87-135-196.compute-1.amazonaws.com

when submitting jobs ./bin/spark-submit --master spark://34.228.165.101:7077 examples/src/main/python/pi.py 100

17/11/29 06:11:00 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/11/29 06:11:00 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://172.31.27.84:4040
17/11/29 06:11:01 INFO SparkContext: Added file file:/home/ec2-user/spark/examples/src/main/python/pi.py at spark://172.31.27.84:42255/files/pi.py with timestamp 1511935861098
17/11/29 06:11:01 INFO Utils: Copying /home/ec2-user/spark/examples/src/main/python/pi.py to /tmp/spark-62afdf21-6993-4235-94e4-36930ed2938b/userFiles-1c2a6fb3-db51-47e0-8671-f9573c6a516d/pi.py
17/11/29 06:11:01 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://34.228.165.101:7077...
17/11/29 06:11:21 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://34.228.165.101:7077...
17/11/29 06:11:41 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://34.228.165.101:7077...

I found pravite IP and Public IP? there maybe something wrong??

17/11/29 06:12:01 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
17/11/29 06:12:01 WARN StandaloneSchedulerBackend: Application ID is not initialized yet.
xiandong79 commented 6 years ago

./bin/spark-submit --master spark://ec2-34-228-165-101.compute-1.amazonaws.com:7077 examples/src/main/python/pi.py 100

17/11/29 06:16:00 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master ec2-34-228-165-101.compute-1.amazonaws.com:7077
nchammas commented 6 years ago

Are you running spark-submit from the Flintrock master? If not, can you try there?

xiandong79 commented 6 years ago

definitely!! I use SSH copied from the Amazon console, and SSH into the master.

Then, there is a spark folder, I spark-submit from the Flintrock master

nchammas commented 6 years ago

I found pravite IP and Public IP? there maybe something wrong??

Hmm, that could be a sign something is wrong. Does your VPC have an internet gateway attached?

xiandong79 commented 6 years ago
VPC ID
vpc-4652db3e

igw-029d0a7b | attached | vpc-4652db3e

I also download flintrock.zip and launch a new cluster. And run the following, still ERROR!

[ec2-user@ip-172-31-18-114 spark]$ ./bin/spark-submit --master spark://54.146.168.248:7077 examples/src/main/python/pi.py   10

flintrock --config us-east-m4-4.yaml describe new-us-east is

new-us-east:
  state: running
  node-count: 5
  master: ec2-54-146-168-248.compute-1.amazonaws.com
  slaves:
    - ec2-54-164-183-18.compute-1.amazonaws.com
    - ec2-34-228-161-228.compute-1.amazonaws.com
    - ec2-34-236-154-38.compute-1.amazonaws.com
    - ec2-52-72-179-230.compute-1.amazonaws.com
nchammas commented 6 years ago

I just launched a test cluster and ran the Pi example as follows:

./spark/bin/spark-submit ./spark/examples/src/main/python/pi.py

./spark/bin/spark-submit --master spark://ec2-54-152-31-224.compute-1.amazonaws.com:7077 ./spark/examples/src/main/python/pi.py 

Both invocations worked fine for me and returned "Pi is roughly...".

By the way, you shouldn't need to specify --master because the master is already specified in conf/spark-env.sh, but it should work either way. So I'm baffled as to why you are seeing issues.

  1. Have you tried calling spark-submit without --master?
  2. When you do specify --master, can you confirm that you're specifying the exact same address as what's in conf/spark-env.sh?
  3. Can you share the full flintrock launch statement you're using? Are there any custom scripts you're running that might affect networking?
xiandong79 commented 6 years ago

worked fine for me and returned "Pi is roughly...".

# For Scala and Java, use run-example:
[ec2-user@ip-172-31-18-114 spark]$ ./bin/run-example SparkPi

# For Python examples, use spark-submit directly:
[ec2-user@ip-172-31-18-114 spark]$ ./bin/spark-submit examples/src/main/python/pi.py

But above command only use private IP

17/11/30 04:28:41 INFO TransportClientFactory: Successfully created connection to /172.31.18.114:44711 after 33 ms (0 ms spent in bootstraps)

The problem happens when specify --master , I want to run jobs in standalone with 4 slaves mode.

$ flintrock --config us-east-m4-4.yaml start us-east-m4-4

the us-east-m4-4.yaml is:

provider: ec2

services:
  spark:
    version: 2.1.1

launch:
  num-slaves: 4

providers:
  ec2:
    key-name: Virginia-us-east-1
    identity-file: /Users/dong/Virginia-us-east-1.pem
    instance-type: m4.large
    region: us-east-1
    ami: ami-a4c7edb2
    user: ec2-user
xiandong79 commented 6 years ago

I do not have conf/spark-env.sh file, but only template flie

[ec2-user@ip-172-31-18-114 conf]$ ls
docker.properties.template  log4j.properties.template    slaves.template               spark-env.sh.template
fairscheduler.xml.template  metrics.properties.template  spark-defaults.conf.template
nchammas commented 6 years ago

Hmm, if the spark/conf directory on your Flintrock master doesn't have spark-env.sh or slaves, then something is going wrong during cluster launch.

  1. Do you see any errors when you run flintrock launch ...?
  2. Are you running the latest release of Flintrock, installed via pip?
xiandong79 commented 6 years ago

Yes !!! I noticed that not find flintrock-manifest.json warn in the terminal. But the cluster is running, I regarded it is fine.

  1. Both 1. pip3 install flintrock 2. down the zip folder
$Flintrock ./flintrock --config us-east-m4-4.yaml launch us-east-m4-4
Launching 5 instances...
[54.175.255.28] SSH online.
[54.175.255.28] Configuring ephemeral storage...
[54.175.255.28] Installing Java 1.8...
[54.173.186.134] SSH online.
[174.129.117.235] SSH online.
[54.198.144.167] SSH online.
[54.173.186.134] Configuring ephemeral storage...
[174.129.117.235] Configuring ephemeral storage...
[54.173.186.134] Installing Java 1.8...
[54.198.144.167] Configuring ephemeral storage...
[174.129.117.235] Installing Java 1.8...
[52.54.108.59] SSH online.
[54.198.144.167] Installing Java 1.8...
[52.54.108.59] Configuring ephemeral storage...
[52.54.108.59] Installing Java 1.8...
[54.198.144.167] Installing Spark...
[54.175.255.28] Installing Spark...
[174.129.117.235] Installing Spark...
[52.54.108.59] Installing Spark...
[54.173.186.134] Installing Spark...
Do you want to terminate the 5 instances created by this operation? [Y/n]: n
Failed to execute script standalone
Traceback (most recent call last):
  File "standalone.py", line 11, in <module>
  File "flintrock/flintrock.py", line 1132, in main
  File "click/core.py", line 722, in __call__
  File "click/core.py", line 697, in main
  File "click/core.py", line 1066, in invoke
  File "click/core.py", line 895, in invoke
  File "click/core.py", line 535, in invoke
  File "click/decorators.py", line 17, in new_func
  File "flintrock/flintrock.py", line 403, in launch
  File "flintrock/ec2.py", line 53, in wrapper
  File "flintrock/ec2.py", line 954, in launch
  File "flintrock/core.py", line 618, in provision_cluster
  File "flintrock/core.py", line 492, in run_against_hosts
  File "concurrent/futures/_base.py", line 405, in result
  File "concurrent/futures/_base.py", line 357, in __get_result
  File "concurrent/futures/thread.py", line 55, in run
  File "flintrock/core.py", line 678, in provision_node
  File "flintrock/services.py", line 359, in configure
  File "flintrock/core.py", line 448, in generate_template_mapping
AttributeError: 'NoneType' object has no attribute 'split'

Do you want to terminate the 5 instances created by this operation? [Y/n]: n

Should I choose "yes" ??

HH, I choose y it also same AttributeError: 'NoneType' object has no attribute 'split'

$Flintrock ./flintrock --config us-east-m4-4.yaml start  test
Cluster is in state 'shutting-down'. Cannot execute start.
Failed to execute script standalone

Maybe the env of my Mac is broken.

xiandong79 commented 6 years ago

I have uninstalled and install flintrock again.

flintrock --config us-east-m4-4.yaml launch test-us-east

Do you want to terminate the 5 instances created by this operation? [Y/n]: y
Terminating instances...
Traceback (most recent call last):
  File "/usr/local/bin/flintrock", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/flintrock/flintrock.py", line 1132, in main
    cli(obj={})
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/flintrock/flintrock.py", line 403, in launch
    tags=ec2_tags)
  File "/usr/local/lib/python3.6/site-packages/flintrock/ec2.py", line 53, in wrapper
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/flintrock/ec2.py", line 954, in launch
    identity_file=identity_file)
  File "/usr/local/lib/python3.6/site-packages/flintrock/core.py", line 618, in provision_cluster
    run_against_hosts(partial_func=partial_func, hosts=hosts)
  File "/usr/local/lib/python3.6/site-packages/flintrock/core.py", line 492, in run_against_hosts
    future.result()
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.6/site-packages/flintrock/core.py", line 678, in provision_node
    cluster=cluster)
  File "/usr/local/lib/python3.6/site-packages/flintrock/services.py", line 359, in configure
    spark_version=self.version or self.git_commit,
  File "/usr/local/lib/python3.6/site-packages/flintrock/core.py", line 448, in generate_template_mapping
    'hadoop_short_version': '.'.join(hadoop_version.split('.')[:2]),
AttributeError: 'NoneType' object has no attribute 'split'

This need attention.

xiandong79 commented 6 years ago

Then launch another one, still

AttributeError: 'NoneType' object has no attribute 'split'.

Go into the master,

[ec2-user@ip-172-31-22-104 conf]$ ls
docker.properties.template  log4j.properties.template    slaves.template               spark-env.sh.template
fairscheduler.xml.template  metrics.properties.template  spark-defaults.conf.template

LOL! ENOUGH hah

nchammas commented 6 years ago

OK, that explains a lot. :) The launch has errors, so of course the resulting cluster doesn’t work as expected.

It looks like you don’t have a Hadoop version specified. Specify one please. You can take a look at the config template in this repo for suggestions.

For example:

services:
  spark:
    version: 2.2.0
  hdfs:
    version: 2.7.4

Flintrock provides a bunch of default values when you first call flintrock configure, but I guess you created your own config from scratch.

I will investigate why Flintrock isn’t providing a clean error message when the Hadoop version is not specified, because that should be happening.

nchammas commented 6 years ago

To be clear, if the launch has errors then do not try to use the cluster anyway. A failed launch means the cluster is likely in a broken state. The launch errors need to be debugged first and a new cluster launched before trying to do anything with the cluster.

I’m honestly surprised you didn’t mention the launch errors from the start. It would have saved us a lot of back and forth debugging the issue here.

xiandong79 commented 6 years ago

I copied exactly from the Sample config.yaml in your README.md

Sample config.yaml also does not specify any Hadoop version or information.

nchammas commented 6 years ago

You're right. I believe the README config example used to work fine, but #196 probably broke this. My apologies. I'll fix this. (The config template does specify the HDFS version, though.)

In any case, are you able to launch a cluster without errors now?

xiandong79 commented 6 years ago

Yes !! I have collected the spark job traces I need!

Thanks!!