pangeo-data / pangeo-cloud-federation

Deployment automation for Pangeo JupyterHubs on AWS, Google, and Azure
https://pangeo.io/cloud.html
58 stars 32 forks source link

Cannot connect to new Dask cluster #792

Closed tjcrone closed 4 years ago

tjcrone commented 4 years ago

When I try to start a new Dask cluster from inside a notebook:

from dask_gateway import Gateway
gateway = Gateway()
cluster = gateway.new_cluster()

I get a Cluster not found error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-a52233bc4e18> in <module>
      1 from dask_gateway import Gateway
      2 gateway = Gateway()
----> 3 cluster = gateway.new_cluster()

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in new_cluster(self, cluster_options, shutdown_on_close, **kwargs)
    641             cluster_options=cluster_options,
    642             shutdown_on_close=shutdown_on_close,
--> 643             **kwargs,
    644         )
    645 

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in __init__(self, address, proxy_address, public_address, auth, cluster_options, shutdown_on_close, asynchronous, loop, **kwargs)
    816             shutdown_on_close=shutdown_on_close,
    817             asynchronous=asynchronous,
--> 818             loop=loop,
    819         )
    820 

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _init_internal(self, address, proxy_address, public_address, auth, cluster_options, cluster_kwargs, shutdown_on_close, asynchronous, loop, name)
    912             self.status = "starting"
    913         if not self.asynchronous:
--> 914             self.gateway.sync(self._start_internal)
    915 
    916     @property

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in sync(self, func, *args, **kwargs)
    337             )
    338             try:
--> 339                 return future.result()
    340             except BaseException:
    341                 future.cancel()

/srv/conda/envs/notebook/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    433                 raise CancelledError()
    434             elif self._state == FINISHED:
--> 435                 return self.__get_result()
    436             else:
    437                 raise TimeoutError()

/srv/conda/envs/notebook/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _start_internal(self)
    926             self._start_task = asyncio.ensure_future(self._start_async())
    927         try:
--> 928             await self._start_task
    929         except BaseException:
    930             # On exception, cleanup

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _start_async(self)
    944         # Connect to cluster
    945         try:
--> 946             report = await self.gateway._wait_for_start(self.name)
    947         except GatewayClusterError:
    948             raise

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _wait_for_start(self, cluster_name)
    566         while True:
    567             try:
--> 568                 report = await self._cluster_report(cluster_name, wait=True)
    569             except TimeoutError:
    570                 # Timeout, ignore

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _cluster_report(self, cluster_name, wait)
    559         params = "?wait" if wait else ""
    560         url = "%s/api/v1/clusters/%s%s" % (self.address, cluster_name, params)
--> 561         resp = await self._request("GET", url)
    562         data = await resp.json()
    563         return ClusterReport._from_json(self._public_address, self.proxy_address, data)

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _request(self, method, url, json)
    407 
    408             if resp.status in {404, 422}:
--> 409                 raise ValueError(msg)
    410             elif resp.status == 409:
    411                 raise GatewayClusterError(msg)

ValueError: Cluster ooi-prod.94433c9e4d69409eaf0acbf438ffc9d4 not found

However the scheduler dask-scheduler-94433c9e4d69409eaf0acbf438ffc9d4 has been started.

Any thoughts on what might be going on here?

tjcrone commented 4 years ago

I get the same type of error when trying to create a new cluster using the Dask labextension. Here are some details on my configuration:

dask.config.config['gateway']
{'auth': {'type': 'jupyterhub', 'kwargs': {}},
 'cluster': {'options': {'image': '{JUPYTER_IMAGE_SPEC}'}},
 'public_address': '/services/dask-gateway/',
 'address': 'http://10.1.128.135:8000/services/dask-gateway/',
 'proxy_address': 'gateway://traefik-ooi-prod-dask-gateway.ooi-prod:80',
 'http-client': {'proxy': True}}
TomAugspurger commented 4 years ago

Any logs from your dask-gateway pods (controller, api, and traefik) or the scheduler pod?

tjcrone commented 4 years ago

I rolled back Helm and restarted all the pods and things seem to be sorted now. At least, I can connect to a Dask cluster. There seems to still be a lot of strangeness. Grafana and other resources installed that I thought we disabled. I wonder if there is any way for us to move toward having a deployment that doesn't change all the time? Something like pangeo-stable and then pangeo-dev? I wonder if that could help things?

TomAugspurger commented 4 years ago

Hard to say when we aren't sure what changed / caused the trouble.

IIRC, Sebastian has been working on doing independent deployments. I'm not sure what the status is on that.

On Wed, Oct 21, 2020 at 10:06 AM Tim Crone notifications@github.com wrote:

I rolled back Helm and restarted all the pods and things seem to be sorted now. At least, I can connect to a Dask cluster. There seems to still be a lot of strangeness. Grafana and other resources installed that I thought we disabled. I wonder if there is any way for us to move toward having a deployment that doesn't change all the time? Something like pangeo-stable and then pangeo-dev? I wonder if that could help things?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-cloud-federation/issues/792#issuecomment-713644671, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOITIOMKPD3KZTMINHY3SL32HPANCNFSM4SZWQTNQ .

tjcrone commented 4 years ago

Okay that would be great! I will close this for now since I think the original issue is solved. Thanks for your help.