Open chiaral opened 4 years ago
This cluster eventually died an hour after I reported this issue.
There is an idle-timeout on clusters. If there's no activity for a while (30 or 60 minutes?) then the cluster shuts itself down.
If this happens again (your local client notebook dies, say), you should be able to reconnect to the cluster. This is one of the advantages of the dask-gateway setup.
>>> from dask_gateway import Gateway
>>> g = Gateway()
>>> g.list_clusters()
If there are any active clusters that will return a list of IDs.
>>> g.list_clusters()
[ClusterReport<name=prod.c288c65c429049e788f41d8308823ca8, status=RUNNING>]
Which you can reconnect to
cluster = g.connect(g.list_clusters()[0].name)
cluster
See https://gateway.dask.org/usage.html#connect-to-an-existing-cluster for more.
I actually tried to look for those instructions online and almost figured out, but did not quite!
it was most likely more 60 min than 30 min. Good to know. I was afraid it would stay up for the whole night!
May I suggest to add this set of instructions as a paragraph to here http://pangeo.io/cloud.html? I went back to it hoping to find some info but there are none. I can try to open a PR but will have to wait for Friday.
I am in a similar situation. I did
from dask_gateway import Gateway
g = Gateway()
g.list_clusters()
which gives
[ClusterReport
cluster = g.connect(g.list_clusters()[0].name)
cluster.close()
but the cluster is still there
Maybe OK now? I don't see any dask-
pods lying around.
I don't know if it's related, but there are some warnings in the logs like
Unable to attach or mount volumes: unmounted volumes=[dask-credentials pangeo-token-fvdq2], unattached volumes=[dask-credentials pangeo-token-fvdq2]: timed out waiting for the condition
Thank you. In fact I think it eventually died because it hadn't been used for 60 min. I am not sure what happened.
On another note - what are the memory/workers limits per user? I am using 23 workers (but with 20Gb memory each in one) and was trying to spin up another cluster with 5/6 workers, but it's not happening. I assume i reached the limit?
thanks a lot, Tom!
More issues today:
i have been having issues with clusters today. I have been kicked out of the platform about 10 min ago (I was running something, and then got long error messages and then suddenly i didn't have any server any more) now i am trying to restart stuff and I get this
---------------------------------------------------------------------------
GatewayServerError Traceback (most recent call last)
<ipython-input-7-1fc5ea0ac38e> in <module>
3 options.worker_memory=20
4 # gateway = Gateway()
----> 5 cluster = gateway.new_cluster(options)
6 #cluster.adapt(minimum=1, maximum=20)
7 cluster.scale(30)
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in new_cluster(self, cluster_options, shutdown_on_close, **kwargs)
623 cluster_options=cluster_options,
624 shutdown_on_close=shutdown_on_close,
--> 625 **kwargs,
626 )
627
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in __init__(self, address, proxy_address, auth, cluster_options, shutdown_on_close, asynchronous, loop, **kwargs)
780 shutdown_on_close=shutdown_on_close,
781 asynchronous=asynchronous,
--> 782 loop=loop,
783 )
784
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _init_internal(self, address, proxy_address, auth, cluster_options, cluster_kwargs, shutdown_on_close, asynchronous, loop, name)
872 self.status = "starting"
873 if not self.asynchronous:
--> 874 self.gateway.sync(self._start_internal)
875
876 @property
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in sync(self, func, *args, **kwargs)
326 )
327 try:
--> 328 return future.result()
329 except BaseException:
330 future.cancel()
/srv/conda/envs/notebook/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
433 raise CancelledError()
434 elif self._state == FINISHED:
--> 435 return self.__get_result()
436 else:
437 raise TimeoutError()
/srv/conda/envs/notebook/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _start_internal(self)
886 self._start_task = asyncio.ensure_future(self._start_async())
887 try:
--> 888 await self._start_task
889 except BaseException:
890 # On exception, cleanup
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _start_async(self)
900 self.status = "starting"
901 self.name = await self.gateway._submit(
--> 902 cluster_options=self._cluster_options, **self._cluster_kwargs
903 )
904 # Connect to cluster
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _submit(self, cluster_options, **kwargs)
511 options = self._config_cluster_options()
512 options.update(kwargs)
--> 513 resp = await self._request("POST", url, json={"cluster_options": options})
514 data = await resp.json()
515 return data["name"]
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _request(self, method, url, json)
393 raise GatewayClusterError(msg)
394 elif resp.status == 500:
--> 395 raise GatewayServerError(msg)
396 else:
397 resp.raise_for_status()
GatewayServerError: 500 Internal Server Error
Server got itself in trouble
I have been also trying to kill this cluster:
[ClusterReport
g = Gateway()
g.list_clusters()
cluster = g.connect(g.list_clusters()[0].name)
cluster.close()
)
was trying to spin up another cluster with 5/6 workers, but it's not happening. I assume i reached the limit?
I don't believe we have any per-user limits in place right now.
In the gateway logs, I see lines like
2020-07-30T15:25:08.528370078Z [E 2020-07-30 15:25:08.527 DaskGateway] Error in cluster informer, retrying...
2020-07-30T15:25:08.528390145Z Traceback (most recent call last):
2020-07-30T15:25:08.528397164Z File "/opt/conda/lib/python3.7/site-packages/aiohttp/connector.py", line 936, in _wrap_create_connection
2020-07-30T15:25:08.528403279Z return await self._loop.create_connection(*args, **kwargs) # type: ignore # noqa
2020-07-30T15:25:08.528409225Z File "/opt/conda/lib/python3.7/asyncio/base_events.py", line 958, in create_connection
2020-07-30T15:25:08.528414896Z raise exceptions[0]
2020-07-30T15:25:08.528420381Z File "/opt/conda/lib/python3.7/asyncio/base_events.py", line 945, in create_connection
2020-07-30T15:25:08.528425758Z await self.sock_connect(sock, address)
2020-07-30T15:25:08.528431227Z File "/opt/conda/lib/python3.7/asyncio/selector_events.py", line 473, in sock_connect
2020-07-30T15:25:08.528436630Z return await fut
2020-07-30T15:25:08.528441887Z File "/opt/conda/lib/python3.7/asyncio/selector_events.py", line 503, in _sock_connect_cb
2020-07-30T15:25:08.528447824Z raise OSError(err, f'Connect call failed {address}')
2020-07-30T15:25:08.528453650Z ConnectionRefusedError: [Errno 111] Connect call failed ('10.39.240.1', 443)
2020-07-30T15:25:08.528459503Z
2020-07-30T15:25:08.528537503Z The above exception was the direct cause of the following exception:
2020-07-30T15:25:08.528546944Z
2020-07-30T15:25:08.528552707Z Traceback (most recent call last):
2020-07-30T15:25:08.528558248Z File "/opt/conda/lib/python3.7/site-packages/dask_gateway_server/backends/kubernetes/utils.py", line 161, in run
2020-07-30T15:25:08.528564324Z initial = await method(**self.method_kwargs)
2020-07-30T15:25:08.528569923Z File "/opt/conda/lib/python3.7/site-packages/kubernetes_asyncio/client/api_client.py", line 166, in __call_api
2020-07-30T15:25:08.528576097Z _request_timeout=_request_timeout)
2020-07-30T15:25:08.528582057Z File "/opt/conda/lib/python3.7/site-packages/kubernetes_asyncio/client/rest.py", line 191, in GET
2020-07-30T15:25:08.528588265Z query_params=query_params))
2020-07-30T15:25:08.528594519Z File "/opt/conda/lib/python3.7/site-packages/kubernetes_asyncio/client/rest.py", line 171, in request
2020-07-30T15:25:08.528600777Z r = await self.pool_manager.request(**args)
2020-07-30T15:25:08.528606636Z File "/opt/conda/lib/python3.7/site-packages/aiohttp/client.py", line 483, in _request
2020-07-30T15:25:08.528612391Z timeout=real_timeout
2020-07-30T15:25:08.528617678Z File "/opt/conda/lib/python3.7/site-packages/aiohttp/connector.py", line 523, in connect
2020-07-30T15:25:08.528623568Z proto = await self._create_connection(req, traces, timeout)
2020-07-30T15:25:08.528629044Z File "/opt/conda/lib/python3.7/site-packages/aiohttp/connector.py", line 859, in _create_connection
2020-07-30T15:25:08.528665891Z req, traces, timeout)
2020-07-30T15:25:08.528679387Z File "/opt/conda/lib/python3.7/site-packages/aiohttp/connector.py", line 1004, in _create_direct_connection
2020-07-30T15:25:08.528686658Z raise last_exc
2020-07-30T15:25:08.528692368Z File "/opt/conda/lib/python3.7/site-packages/aiohttp/connector.py", line 986, in _create_direct_connection
2020-07-30T15:25:08.528698416Z req=req, client_error=client_error)
2020-07-30T15:25:08.528703944Z File "/opt/conda/lib/python3.7/site-packages/aiohttp/connector.py", line 943, in _wrap_create_connection
2020-07-30T15:25:08.528709756Z raise client_error(req.connection_key, exc) from exc
2020-07-30T15:25:08.528715432Z aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.39.240.1:443 ssl:default [Connect call failed ('10.39.240.1', 443)]
was trying to spin up another cluster with 5/6 workers, but it's not happening. I assume i reached the limit?
Right now I see that the kubernetes cluster is adding more machines for Dask workers, which usually takes a few minutes.
Actually, it's not adding more machines.
pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max node group size reached, 1 max cluster cpu limit reached | NotTriggerScaleUp | Jul 30, 2020, 10:27:47 AM | Jul 30, 2020, 11:08:51 AM | 149 |
-- | -- | -- | -- | -- | --
0/28 nodes are available: 28 Insufficient memory, 5 Insufficient cpu. | FailedScheduling | Jul 30, 2020, 10:47:44 AM | Jul 30, 2020, 11:08:15 AM | 16
@jhamman do you recall the restrictions on which machine types the node autoscaler will use? It should be any n1-
type, so 20Gb of memory / worker shouldn't be a problem?
@jhamman do you recall the restrictions on which machine types the node autoscaler will use? It should be any n1- type, so 20Gb of memory / worker shouldn't be a problem?
Right.
The thing I'm seeing above is:
1 max cluster cpu limit reached
OK, that's in the autoprovisioning.json
file. We have that capped at 400. Are we OK increasing it? Say to 4000?
sounds good to me but @rabernat should weight in.
Hello all thanks for weighing in. I am fine with whatever limit is in place, I just would like to know which one is it, so I can plan things :) no need to increase anything yet!
I am not sure what happened, but something seemed off this morning. I got some weird errors I never got before.
I still have that hanging cluster (ClusterReport<name=prod.148abd2ad3dd4f409924265a8d254f08, status=RUNNING) that I tried to kill for a while. I simply scaled it to 0 (and I could do that). But cluster.close
on it doesn't work. It will stop working eventually when it hit the idle-timeout and it is to 0 now, but it's weird I cannot close it.
Yeah, I think the internal dask-gateway failure at https://github.com/pangeo-data/pangeo-cloud-federation/issues/658#issuecomment-666494483 caused a cluster / some workers to be un-closeable through normal means.
Requesting a second cluster (and another user having a cluster at the same time) possibly made us hit the cluster-wide CPU limit.
How are things going here @chiaral?
I think there were a couple issues: no clear instructions on how to close a running cluster that you aren't connected to.
May I suggest to add this set of instructions as a paragraph to here http://pangeo.io/cloud.html? I went back to it hoping to find some info but there are none.
Done in https://github.com/pangeo-data/pangeo/pull/783
Then there was a secondary issue where dask-gateway itself got in a bad state and you couldn't even connect to running clusters with those instructions (I'm not sure what to do here).
Finally, we might want to consider bumping the max core count on the kubernetes cluster. Right now we refuse to scale past 400 cores.
I launched a dask gateway cluster dask-gateway/clusters/prod.b42143a1ef2840cab44304f8f476cabd/
at some point the notebook stalled and I lost connection (I don't think the problem was the cluster but my connection, but i am not sure 100%)
I couldn't do
cluster.close()
because the notebook was not responsive, so I restarted the kernel. That usually closes the cluster as well.But right now the cluster is still up there https://us-central1-b.gcp.pangeo.io/services/dask-gateway/clusters/prod.b42143a1ef2840cab44304f8f476cabd/status
I am not sure how I can kill it!
Is there a way for an user to kill a cluster in this situation? Is it possible to have that nice feature we used to have with dask Kubernetes clusters, which used to list all the clusters created on the left column? I don't think we really needed all the functionalities, but a "shutdown" button could be useful.