tensorflow / ecosystem

Integration of TensorFlow with other open-source frameworks
Apache License 2.0
1.37k stars 391 forks source link

Spark tensorFlow distributor workers not working #161

Open rayjinghaolei opened 4 years ago

rayjinghaolei commented 4 years ago

RayKo-MBP:spark RAY$ ./tests/integration/run.sh Stopping spark_worker_2 ... done Stopping spark_worker_1 ... done Stopping spark_master_1 ... done Removing spark_worker_2 ... done Removing spark_worker_1 ... done Removing spark_master_1 ... done Removing network spark_default Creating network "spark_default" with the default driver Creating spark_master_1 ... done WARNING: The "worker" service specifies a port on the host. If multiple containers for this service are created on a single host, the port will clash. Creating spark_worker_1 ... done Creating spark_worker_2 ... done ============================= test session starts ============================== platform linux -- Python 3.7.5, pytest-5.4.3, py-1.8.1, pluggy-0.13.1 rootdir: /mnt/spark-tensorflow-distributor/tests/integration, inifile: pytest.ini collected 17 items

tests/integration/test_mirrored_strategy_runner.py No container found for worker_1 No container found for worker_2 no org.apache.spark.deploy.master.Master to stop starting org.apache.spark.deploy.master.Master, logging to /usr/local/lib/python3.7/dist-packages/pyspark/logs/spark--org.apache.spark.deploy.master.Master-1-master.out No container found for worker_1 No container found for worker_2 Starting worker 1 Starting worker 2 20/06/13 18:52:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 20/06/13 18:56:04 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 40 more times 20/06/13 18:56:19 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 39 more times 20/06/13 18:56:34 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 38 more times 20/06/13 18:56:49 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 37 more times 20/06/13 18:57:04 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 36 more times 20/06/13 18:57:19 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 35 more times 20/06/13 18:57:34 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 34 more times 20/06/13 18:57:49 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 33 more times 20/06/13 18:58:04 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 32 more times 20/06/13 18:58:19 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 31 more times 20/06/13 18:58:34 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 30 more times 20/06/13 18:58:49 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 29 more times 20/06/13 18:59:04 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 28 more times 20/06/13 18:59:19 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 27 more times 20/06/13 18:59:34 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 26 more times 20/06/13 18:59:49 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 25 more times 20/06/13 19:00:04 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 24 more times 20/06/13 19:00:19 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 23 more times 20/06/13 19:00:34 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 22 more times 20/06/13 19:00:49 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 21 more times 20/06/13 19:01:04 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 20 more times 20/06/13 19:01:19 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 19 more times 20/06/13 19:01:34 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 18 more times 20/06/13 19:01:49 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 17 more times 20/06/13 19:02:04 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 16 more times 20/06/13 19:02:19 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 15 more times 20/06/13 19:02:34 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 14 more times 20/06/13 19:02:49 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 13 more times 20/06/13 19:03:04 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 12 more times 20/06/13 19:03:19 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 11 more times 20/06/13 19:03:34 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 10 more times 20/06/13 19:03:49 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 9 more times 20/06/13 19:04:04 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 8 more times 20/06/13 19:04:19 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 7 more times 20/06/13 19:04:34 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 6 more times 20/06/13 19:04:49 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 5 more times 20/06/13 19:05:04 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 4 more times 20/06/13 19:05:19 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 3 more times 20/06/13 19:05:34 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 2 more times 20/06/13 19:05:49 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 1 more times 20/06/13 19:06:03 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 0 more times F.2020-06-13 19:06:06.589740: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2020-06-13 19:06:06.589812: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303) 2020-06-13 19:06:06.589854: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (master): /proc/driver/nvidia/version does not exist 2020-06-13 19:06:06.591429: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-06-13 19:06:06.603496: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2916350000 Hz 2020-06-13 19:06:06.604624: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f0128000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-06-13 19:06:06.604804: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version ............No container found for worker_1 No container found for worker_2 stopping org.apache.spark.deploy.master.Master starting org.apache.spark.deploy.master.Master, logging to /usr/local/lib/python3.7/dist-packages/pyspark/logs/spark--org.apache.spark.deploy.master.Master-1-master.out No container found for worker_1 No container found for worker_2 Starting worker 1 Starting worker 2 .No container found for worker_1 No container found for worker_2 stopping org.apache.spark.deploy.master.Master starting org.apache.spark.deploy.master.Master, logging to /usr/local/lib/python3.7/dist-packages/pyspark/logs/spark--org.apache.spark.deploy.master.Master-1-master.out No container found for worker_1 No container found for worker_2 Starting worker 1 Starting worker 2 20/06/13 19:09:21 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master master:7077 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:303) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109) at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anon$1.run(StandaloneAppClient.scala:106) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Failed to connect to master/172.20.0.2:7077 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198) ... 4 more Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: master/172.20.0.2:7077 Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) 20/06/13 19:12:21 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 40 more times 20/06/13 19:12:36 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 39 more times 20/06/13 19:12:51 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 38 more times 20/06/13 19:13:06 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 37 more times 20/06/13 19:13:21 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 36 more times 20/06/13 19:13:36 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 35 more times 20/06/13 19:13:51 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 34 more times 20/06/13 19:14:06 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 33 more times 20/06/13 19:14:21 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 32 more times 20/06/13 19:14:36 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 31 more times 20/06/13 19:14:51 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 30 more times 20/06/13 19:15:06 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 29 more times 20/06/13 19:15:21 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 28 more times 20/06/13 19:15:36 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 27 more times 20/06/13 19:15:51 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 26 more times 20/06/13 19:16:06 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 25 more times 20/06/13 19:16:21 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 24 more times 20/06/13 19:16:35 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 23 more times 20/06/13 19:16:50 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 22 more times 20/06/13 19:17:05 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 21 more times 20/06/13 19:17:20 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 20 more times 20/06/13 19:17:35 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 19 more times 20/06/13 19:17:50 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 18 more times 20/06/13 19:18:05 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 17 more times 20/06/13 19:18:20 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 16 more times 20/06/13 19:18:35 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 15 more times 20/06/13 19:18:50 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 14 more times 20/06/13 19:19:05 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 13 more times 20/06/13 19:19:20 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 12 more times 20/06/13 19:19:35 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 11 more times 20/06/13 19:19:50 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 10 more times 20/06/13 19:20:05 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 9 more times 20/06/13 19:20:20 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 8 more times 20/06/13 19:20:35 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 7 more times 20/06/13 19:20:50 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 6 more times 20/06/13 19:21:05 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 5 more times 20/06/13 19:21:20 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 4 more times 20/06/13 19:21:35 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 3 more times 20/06/13 19:21:50 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 2 more times 20/06/13 19:22:05 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 1 more times 20/06/13 19:22:20 WARN DAGScheduler: Barrier stage in job 0 requires 1 slots, but only 0 are available. Will retry up to 0 more times FNo container found for worker_1 No container found for worker_2 stopping org.apache.spark.deploy.master.Master starting org.apache.spark.deploy.master.Master, logging to /usr/local/lib/python3.7/dist-packages/pyspark/logs/spark--org.apache.spark.deploy.master.Master-1-master.out No container found for worker_1 No container found for worker_2 Starting worker 1 Starting worker 2 20/06/13 19:22:27 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master master:7077 org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:303) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109) at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anon$1.run(StandaloneAppClient.scala:106) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Failed to connect to master/172.20.0.2:7077 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198) ... 4 more Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: master/172.20.0.2:7077 Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) 20/06/13 19:25:27 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 40 more times 20/06/13 19:25:42 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 39 more times 20/06/13 19:25:57 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 38 more times 20/06/13 19:26:12 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 37 more times 20/06/13 19:26:27 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 36 more times 20/06/13 19:26:42 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 35 more times 20/06/13 19:26:57 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 34 more times 20/06/13 19:27:12 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 33 more times 20/06/13 19:27:27 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 32 more times 20/06/13 19:27:42 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 31 more times 20/06/13 19:27:57 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 30 more times 20/06/13 19:28:12 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 29 more times 20/06/13 19:28:27 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 28 more times 20/06/13 19:28:42 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 27 more times 20/06/13 19:28:57 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 26 more times 20/06/13 19:29:12 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 25 more times 20/06/13 19:29:27 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 24 more times 20/06/13 19:29:42 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 23 more times 20/06/13 19:29:57 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 22 more times 20/06/13 19:30:12 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 21 more times 20/06/13 19:30:27 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 20 more times 20/06/13 19:30:42 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 19 more times 20/06/13 19:30:56 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 18 more times 20/06/13 19:31:11 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 17 more times 20/06/13 19:31:26 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 16 more times 20/06/13 19:31:41 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 15 more times 20/06/13 19:31:56 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 14 more times 20/06/13 19:32:11 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 13 more times 20/06/13 19:32:26 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 12 more times 20/06/13 19:32:41 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 11 more times 20/06/13 19:32:56 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 10 more times 20/06/13 19:33:11 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 9 more times 20/06/13 19:33:26 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 8 more times 20/06/13 19:33:41 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 7 more times 20/06/13 19:33:56 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 6 more times 20/06/13 19:34:11 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 5 more times 20/06/13 19:34:26 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 4 more times 20/06/13 19:34:41 WARN DAGScheduler: Barrier stage in job 0 requires 2 slots, but only 0 are available. Will retry up to 3 more times ^CERROR: Aborting.

mengxr commented 4 years ago

cc: @WeichenXu123 @sarthfrey

sarthfrey commented 4 years ago

Confirmed with @rayjinghaolei offline, his docker engine doesn't have enough memory to run the tests. Perhaps we should add a warning to the test running instructions.

sarthfrey commented 4 years ago

I will try to reproduce and see if we can catch this incident and deliver a good error message to the test runner