pangeo-data / pangeo-cloud-federation

Deployment automation for Pangeo JupyterHubs on AWS, Google, and Azure
https://pangeo.io/cloud.html
58 stars 32 forks source link

zombie cluster? #658

Open chiaral opened 4 years ago

chiaral commented 4 years ago

I launched a dask gateway cluster dask-gateway/clusters/prod.b42143a1ef2840cab44304f8f476cabd/

at some point the notebook stalled and I lost connection (I don't think the problem was the cluster but my connection, but i am not sure 100%)

I couldn't do cluster.close() because the notebook was not responsive, so I restarted the kernel. That usually closes the cluster as well.

But right now the cluster is still up there https://us-central1-b.gcp.pangeo.io/services/dask-gateway/clusters/prod.b42143a1ef2840cab44304f8f476cabd/status

I am not sure how I can kill it!

Is there a way for an user to kill a cluster in this situation? Is it possible to have that nice feature we used to have with dask Kubernetes clusters, which used to list all the clusters created on the left column? I don't think we really needed all the functionalities, but a "shutdown" button could be useful.

Screen Shot 2020-07-14 at 10 16 58 PM
chiaral commented 4 years ago

This cluster eventually died an hour after I reported this issue.

TomAugspurger commented 4 years ago

There is an idle-timeout on clusters. If there's no activity for a while (30 or 60 minutes?) then the cluster shuts itself down.

If this happens again (your local client notebook dies, say), you should be able to reconnect to the cluster. This is one of the advantages of the dask-gateway setup.

>>> from dask_gateway import Gateway

>>> g = Gateway()
>>> g.list_clusters()

If there are any active clusters that will return a list of IDs.

>>> g.list_clusters()
[ClusterReport<name=prod.c288c65c429049e788f41d8308823ca8, status=RUNNING>]

Which you can reconnect to

cluster = g.connect(g.list_clusters()[0].name)
cluster

See https://gateway.dask.org/usage.html#connect-to-an-existing-cluster for more.

chiaral commented 4 years ago

I actually tried to look for those instructions online and almost figured out, but did not quite!

it was most likely more 60 min than 30 min. Good to know. I was afraid it would stay up for the whole night!

May I suggest to add this set of instructions as a paragraph to here http://pangeo.io/cloud.html? I went back to it hoping to find some info but there are none. I can try to open a PR but will have to wait for Friday.

chiaral commented 4 years ago

I am in a similar situation. I did

from dask_gateway import Gateway

g = Gateway()
g.list_clusters()

which gives

[ClusterReport]

cluster = g.connect(g.list_clusters()[0].name)
cluster.close()

but the cluster is still there

TomAugspurger commented 4 years ago

Maybe OK now? I don't see any dask- pods lying around.

I don't know if it's related, but there are some warnings in the logs like

Unable to attach or mount volumes: unmounted volumes=[dask-credentials pangeo-token-fvdq2], unattached volumes=[dask-credentials pangeo-token-fvdq2]: timed out waiting for the condition
chiaral commented 4 years ago

Thank you. In fact I think it eventually died because it hadn't been used for 60 min. I am not sure what happened.

On another note - what are the memory/workers limits per user? I am using 23 workers (but with 20Gb memory each in one) and was trying to spin up another cluster with 5/6 workers, but it's not happening. I assume i reached the limit?

thanks a lot, Tom!

chiaral commented 4 years ago

More issues today:

i have been having issues with clusters today. I have been kicked out of the platform about 10 min ago (I was running something, and then got long error messages and then suddenly i didn't have any server any more) now i am trying to restart stuff and I get this

---------------------------------------------------------------------------
GatewayServerError                        Traceback (most recent call last)
<ipython-input-7-1fc5ea0ac38e> in <module>
      3 options.worker_memory=20
      4 # gateway = Gateway()
----> 5 cluster = gateway.new_cluster(options)
      6 #cluster.adapt(minimum=1, maximum=20)
      7 cluster.scale(30)

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in new_cluster(self, cluster_options, shutdown_on_close, **kwargs)
    623             cluster_options=cluster_options,
    624             shutdown_on_close=shutdown_on_close,
--> 625             **kwargs,
    626         )
    627 

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in __init__(self, address, proxy_address, auth, cluster_options, shutdown_on_close, asynchronous, loop, **kwargs)
    780             shutdown_on_close=shutdown_on_close,
    781             asynchronous=asynchronous,
--> 782             loop=loop,
    783         )
    784 

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _init_internal(self, address, proxy_address, auth, cluster_options, cluster_kwargs, shutdown_on_close, asynchronous, loop, name)
    872             self.status = "starting"
    873         if not self.asynchronous:
--> 874             self.gateway.sync(self._start_internal)
    875 
    876     @property

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in sync(self, func, *args, **kwargs)
    326             )
    327             try:
--> 328                 return future.result()
    329             except BaseException:
    330                 future.cancel()

/srv/conda/envs/notebook/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    433                 raise CancelledError()
    434             elif self._state == FINISHED:
--> 435                 return self.__get_result()
    436             else:
    437                 raise TimeoutError()

/srv/conda/envs/notebook/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _start_internal(self)
    886             self._start_task = asyncio.ensure_future(self._start_async())
    887         try:
--> 888             await self._start_task
    889         except BaseException:
    890             # On exception, cleanup

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _start_async(self)
    900             self.status = "starting"
    901             self.name = await self.gateway._submit(
--> 902                 cluster_options=self._cluster_options, **self._cluster_kwargs
    903             )
    904         # Connect to cluster

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _submit(self, cluster_options, **kwargs)
    511             options = self._config_cluster_options()
    512             options.update(kwargs)
--> 513         resp = await self._request("POST", url, json={"cluster_options": options})
    514         data = await resp.json()
    515         return data["name"]

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _request(self, method, url, json)
    393                 raise GatewayClusterError(msg)
    394             elif resp.status == 500:
--> 395                 raise GatewayServerError(msg)
    396             else:
    397                 resp.raise_for_status()

GatewayServerError: 500 Internal Server Error

Server got itself in trouble

I have been also trying to kill this cluster: [ClusterReport] but nothing happens. (I do:

g = Gateway()
g.list_clusters()
cluster = g.connect(g.list_clusters()[0].name)
cluster.close()

)

TomAugspurger commented 4 years ago

was trying to spin up another cluster with 5/6 workers, but it's not happening. I assume i reached the limit?

I don't believe we have any per-user limits in place right now.

In the gateway logs, I see lines like

2020-07-30T15:25:08.528370078Z [E 2020-07-30 15:25:08.527 DaskGateway] Error in cluster informer, retrying...
2020-07-30T15:25:08.528390145Z Traceback (most recent call last):
2020-07-30T15:25:08.528397164Z   File "/opt/conda/lib/python3.7/site-packages/aiohttp/connector.py", line 936, in _wrap_create_connection
2020-07-30T15:25:08.528403279Z     return await self._loop.create_connection(*args, **kwargs)  # type: ignore  # noqa
2020-07-30T15:25:08.528409225Z   File "/opt/conda/lib/python3.7/asyncio/base_events.py", line 958, in create_connection
2020-07-30T15:25:08.528414896Z     raise exceptions[0]
2020-07-30T15:25:08.528420381Z   File "/opt/conda/lib/python3.7/asyncio/base_events.py", line 945, in create_connection
2020-07-30T15:25:08.528425758Z     await self.sock_connect(sock, address)
2020-07-30T15:25:08.528431227Z   File "/opt/conda/lib/python3.7/asyncio/selector_events.py", line 473, in sock_connect
2020-07-30T15:25:08.528436630Z     return await fut
2020-07-30T15:25:08.528441887Z   File "/opt/conda/lib/python3.7/asyncio/selector_events.py", line 503, in _sock_connect_cb
2020-07-30T15:25:08.528447824Z     raise OSError(err, f'Connect call failed {address}')
2020-07-30T15:25:08.528453650Z ConnectionRefusedError: [Errno 111] Connect call failed ('10.39.240.1', 443)
2020-07-30T15:25:08.528459503Z 
2020-07-30T15:25:08.528537503Z The above exception was the direct cause of the following exception:
2020-07-30T15:25:08.528546944Z 
2020-07-30T15:25:08.528552707Z Traceback (most recent call last):
2020-07-30T15:25:08.528558248Z   File "/opt/conda/lib/python3.7/site-packages/dask_gateway_server/backends/kubernetes/utils.py", line 161, in run
2020-07-30T15:25:08.528564324Z     initial = await method(**self.method_kwargs)
2020-07-30T15:25:08.528569923Z   File "/opt/conda/lib/python3.7/site-packages/kubernetes_asyncio/client/api_client.py", line 166, in __call_api
2020-07-30T15:25:08.528576097Z     _request_timeout=_request_timeout)
2020-07-30T15:25:08.528582057Z   File "/opt/conda/lib/python3.7/site-packages/kubernetes_asyncio/client/rest.py", line 191, in GET
2020-07-30T15:25:08.528588265Z     query_params=query_params))
2020-07-30T15:25:08.528594519Z   File "/opt/conda/lib/python3.7/site-packages/kubernetes_asyncio/client/rest.py", line 171, in request
2020-07-30T15:25:08.528600777Z     r = await self.pool_manager.request(**args)
2020-07-30T15:25:08.528606636Z   File "/opt/conda/lib/python3.7/site-packages/aiohttp/client.py", line 483, in _request
2020-07-30T15:25:08.528612391Z     timeout=real_timeout
2020-07-30T15:25:08.528617678Z   File "/opt/conda/lib/python3.7/site-packages/aiohttp/connector.py", line 523, in connect
2020-07-30T15:25:08.528623568Z     proto = await self._create_connection(req, traces, timeout)
2020-07-30T15:25:08.528629044Z   File "/opt/conda/lib/python3.7/site-packages/aiohttp/connector.py", line 859, in _create_connection
2020-07-30T15:25:08.528665891Z     req, traces, timeout)
2020-07-30T15:25:08.528679387Z   File "/opt/conda/lib/python3.7/site-packages/aiohttp/connector.py", line 1004, in _create_direct_connection
2020-07-30T15:25:08.528686658Z     raise last_exc
2020-07-30T15:25:08.528692368Z   File "/opt/conda/lib/python3.7/site-packages/aiohttp/connector.py", line 986, in _create_direct_connection
2020-07-30T15:25:08.528698416Z     req=req, client_error=client_error)
2020-07-30T15:25:08.528703944Z   File "/opt/conda/lib/python3.7/site-packages/aiohttp/connector.py", line 943, in _wrap_create_connection
2020-07-30T15:25:08.528709756Z     raise client_error(req.connection_key, exc) from exc
2020-07-30T15:25:08.528715432Z aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.39.240.1:443 ssl:default [Connect call failed ('10.39.240.1', 443)]

was trying to spin up another cluster with 5/6 workers, but it's not happening. I assume i reached the limit?

TomAugspurger commented 4 years ago

Right now I see that the kubernetes cluster is adding more machines for Dask workers, which usually takes a few minutes.

TomAugspurger commented 4 years ago

Actually, it's not adding more machines.


pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1  max node group size reached, 1 max cluster cpu limit reached | NotTriggerScaleUp | Jul 30, 2020, 10:27:47 AM | Jul 30, 2020, 11:08:51 AM | 149 |  
-- | -- | -- | -- | -- | --
0/28 nodes are available: 28 Insufficient memory, 5 Insufficient cpu. | FailedScheduling | Jul 30, 2020, 10:47:44 AM | Jul 30, 2020, 11:08:15 AM | 16
TomAugspurger commented 4 years ago

@jhamman do you recall the restrictions on which machine types the node autoscaler will use? It should be any n1- type, so 20Gb of memory / worker shouldn't be a problem?

jhamman commented 4 years ago

@jhamman do you recall the restrictions on which machine types the node autoscaler will use? It should be any n1- type, so 20Gb of memory / worker shouldn't be a problem?

Right.

The thing I'm seeing above is:

1 max cluster cpu limit reached
TomAugspurger commented 4 years ago

OK, that's in the autoprovisioning.json file. We have that capped at 400. Are we OK increasing it? Say to 4000?

jhamman commented 4 years ago

sounds good to me but @rabernat should weight in.

chiaral commented 4 years ago

Hello all thanks for weighing in. I am fine with whatever limit is in place, I just would like to know which one is it, so I can plan things :) no need to increase anything yet!

I am not sure what happened, but something seemed off this morning. I got some weird errors I never got before.

I still have that hanging cluster (ClusterReport<name=prod.148abd2ad3dd4f409924265a8d254f08, status=RUNNING) that I tried to kill for a while. I simply scaled it to 0 (and I could do that). But cluster.close on it doesn't work. It will stop working eventually when it hit the idle-timeout and it is to 0 now, but it's weird I cannot close it.

TomAugspurger commented 4 years ago

Yeah, I think the internal dask-gateway failure at https://github.com/pangeo-data/pangeo-cloud-federation/issues/658#issuecomment-666494483 caused a cluster / some workers to be un-closeable through normal means.

Requesting a second cluster (and another user having a cluster at the same time) possibly made us hit the cluster-wide CPU limit.

TomAugspurger commented 4 years ago

How are things going here @chiaral?

I think there were a couple issues: no clear instructions on how to close a running cluster that you aren't connected to.

May I suggest to add this set of instructions as a paragraph to here http://pangeo.io/cloud.html? I went back to it hoping to find some info but there are none.

Done in https://github.com/pangeo-data/pangeo/pull/783

Then there was a secondary issue where dask-gateway itself got in a bad state and you couldn't even connect to running clusters with those instructions (I'm not sure what to do here).

Finally, we might want to consider bumping the max core count on the kubernetes cluster. Right now we refuse to scale past 400 cores.