Closed jhamman closed 4 years ago
make sure dask clusters are putting pods on correct nodes (check taints/affinities/etc)
@jhamman which node pool do you expect the dask scheduler to land on? IMO, it makes sense for it to be in the jupyter pool (not preemptible), and the Dask workers to be in the dask-pool (preemptible).
I think right now we're adding the scheduler to the dask-pool, but it doesn't have the right toleration to run on preemptible nodes.
but it doesn't have the right toleration to run on preemptible nodes.
Nevermind, it does have the toleration. We're getting failures to start clusters because a different issue (looking into it now).
May still be worth discussing where the scheduler is run though.
I agree, the scheduler should be in the same pool as the jupyter notebook. We may need to just use the jupyter toleration here.
@jcrist I'm looking into securing the dask gateways with TLS. I think lack of HTTPs is preventing the clusters working with the juptyerlab plugin (loading http content on an https webpage). I had a high-level question before I began digging into it too far.
Our jupyterhubs are using zero-to-jupyterhub's auto-https. This makes an autohttps
pod that, IIUC, handles all the lets encrypt stuff automatically.
Is it sensible / possible for the dask-gateway pods to reuse that setup to get the TLS certificates in place?
TLS is now working on staging.hub.pangeo.io, which fixed the dask-labextension.
Thanks Jim!
@TomAugspurger - FYI, I've marked all the public gateway IP addresses as static.
@jhamman and @TomAugspurger - an initial test of this is not working on the AWS hub, probably related to getting https set up on the various SVCs created by dask-gateway. Might also be due to relatively new autohttps setup on jupyterhub https://github.com/pangeo-data/pangeo-cloud-federation/issues/563.
Following this configuration https://github.com/pangeo-data/pangeo-cloud-federation/pull/520/files
I see
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
gateway-api-icesat2-prod-dask-gateway ClusterIP 10.100.228.113 <none> 8001/TCP 31d
hub ClusterIP 10.100.88.80 <none> 8081/TCP 347d
proxy-api ClusterIP 10.100.43.198 <none> 8001/TCP 347d
proxy-http ClusterIP 10.100.135.9 <none> 8000/TCP 347d
proxy-public LoadBalancer 10.100.166.102 XXXX(PROXY-PUBLIC).us-west-2.elb.amazonaws.com 443:32434/TCP,80:30204/TCP 347d
scheduler-api-icesat2-prod-dask-gateway ClusterIP 10.100.133.197 <none> 8001/TCP 31d
scheduler-public-icesat2-prod-dask-gateway LoadBalancer 10.100.199.78 XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com 8786:31542/TCP 31d
web-api-icesat2-prod-dask-gateway ClusterIP 10.100.42.60 <none> 8001/TCP 31d
web-public-icesat2-prod-dask-gateway LoadBalancer 10.100.192.199 XXXX(WEB-PUBLIC).us-west-2.elb.amazonaws.com 80:30816/TCP 31d
And if I set:
gateway = Gateway(address='https://XXXX(WEB-PUBLIC)-west-2.elb.amazonaws.com/services/dask-gateway',
proxy_address='tls://XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com:8786',
auth='jupyterhub')
I get the following traceback running cluster = gateway.new_cluster()
---------------------------------------------------------------------------
HTTPTimeoutError Traceback (most recent call last)
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in _fetch(self, req)
345 self._cookie_jar.pre_request(req)
--> 346 resp = await client.fetch(req, raise_error=False)
347 if resp.code == 401:
HTTPTimeoutError: Timeout while connecting
During handling of the above exception, another exception occurred:
TimeoutError Traceback (most recent call last)
<ipython-input-7-bf15802dc141> in <module>
----> 1 cluster = gateway.new_cluster()
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in new_cluster(self, cluster_options, shutdown_on_close, **kwargs)
581 cluster : GatewayCluster
582 """
--> 583 return GatewayCluster(
584 address=self.address,
585 proxy_address=self.proxy_address,
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in __init__(self, address, proxy_address, auth, cluster_options, shutdown_on_close, asynchronous, loop, **kwargs)
751 **kwargs,
752 )--> 753 self._init_internal(
754 address=address,
755 proxy_address=proxy_address,
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in _init_internal(self, address, proxy_address, auth, cluster_options, cluster_kwargs, shutdown_on_close, asynchronous, loop, name)
851 self.status = "starting"
852 if not self.asynchronous:
--> 853 self.gateway.sync(self._start_internal)
854
855 @property
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in sync(self, func, *args, **kwargs)
310 )
311 try:
--> 312 return future.result()
313 except BaseException:
314 future.cancel()
/srv/conda/envs/notebook/lib/python3.8/concurrent/futures/_base.py in result(self, timeout)
437 raise CancelledError()
438 elif self._state == FINISHED:
--> 439 return self.__get_result()
440 else:
441 raise TimeoutError()
/srv/conda/envs/notebook/lib/python3.8/concurrent/futures/_base.py in __get_result(self)
386 def __get_result(self):
387 if self._exception:
--> 388 raise self._exception
389 else:
390 return self._result
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in _start_internal(self)
865 self._start_task = asyncio.ensure_future(self._start_async())
866 try:
--> 867 await self._start_task
868 except BaseException:
869 # On exception, cleanup
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in _start_async(self)
878 if self.status == "created":
879 self.status = "starting"
--> 880 self.name = await self.gateway._submit(
881 cluster_options=self._cluster_options, **self._cluster_kwargs
882 )
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in _submit(self, cluster_options, **kwargs)
475 headers=HTTPHeaders({"Content-type": "application/json"}),
476 )
--> 477 resp = await self._fetch(req)
478 data = json.loads(resp.body)
479 return data["name"]
/srv/conda/envs/notebook/lib/python3.8/site-packages/dask_gateway/client.py in _fetch(self, req)
370 # Tornado 6 still raises these above with raise_error=False
371 if exc.code == 599:
--> 372 raise TimeoutError("Request timed out")
373 # Should never get here!
374 raise
TimeoutError: Request timed out
kubectl logs scheduler-proxy-icesat2-prod-dask-gateway-6c584cd5b7-2v9mq -n icesat2-prod
is reporting [W 2020-03-17 23:20:35.656 SchedulerProxy] Extracting SNI: Error reading TLS record header: EOF
every couple of seconds. I'm suspicious https isn't automatically being enabled for those external IPs because dropping the '(s)' from https works and I'm able to access the dashboard URL.
Seems related to https://github.com/dask/dask-gateway/issues/191, so pinging @jcrist and @yuvipanda
I think you want:
# Use the jupyterhub proxy address above, not the dask-gateway proxy address
gateway = Gateway(address='XXXX(PROXY-PUBLIC).us-west-2.elb.amazonaws.com/services/dask-gateway',
proxy_address='tls://XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com:8786',
auth='jupyterhub')
gateway.list_clusters()
Assuming you're intending on routing through the JupyterHub proxy.
Thanks for the tip @jcrist - yes that is what we're going for. I was confused as to which IP to use. Unfortunately using that ELB or the JHub public address I see two different tracebacks:
gateway = Gateway(address='XXXX(PROXY-PUBLIC).us-west-2.elb.amazonaws.com/services/dask-gateway',
proxy_address='tls://XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com:8786',
auth='jupyterhub')
---------------------------------------------------------------------------
SSLError Traceback (most recent call last)
<ipython-input-28-bf15802dc141> in <module>
----> 1 cluster = gateway.new_cluster()
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in new_cluster(self, cluster_options, shutdown_on_close, **kwargs)
589 cluster_options=cluster_options,
590 shutdown_on_close=shutdown_on_close,
--> 591 **kwargs,
592 )
593
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in __init__(self, address, proxy_address, auth, cluster_options, shutdown_on_close, asynchronous, loop, **kwargs)
759 shutdown_on_close=shutdown_on_close,
760 asynchronous=asynchronous,
--> 761 loop=loop,
762 )
763
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _init_internal(self, address, proxy_address, auth, cluster_options, cluster_kwargs, shutdown_on_close, asynchronous, loop, name)
851 self.status = "starting"
852 if not self.asynchronous:
--> 853 self.gateway.sync(self._start_internal)
854
855 @property
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in sync(self, func, *args, **kwargs)
310 )
311 try:
--> 312 return future.result()
313 except BaseException:
314 future.cancel()
/srv/conda/envs/pangeo/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
433 raise CancelledError()
434 elif self._state == FINISHED:
--> 435 return self.__get_result()
436 else:
437 raise TimeoutError()
/srv/conda/envs/pangeo/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _start_internal(self)
865 self._start_task = asyncio.ensure_future(self._start_async())
866 try:
--> 867 await self._start_task
868 except BaseException:
869 # On exception, cleanup
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _start_async(self)
879 self.status = "starting"
880 self.name = await self.gateway._submit(
--> 881 cluster_options=self._cluster_options, **self._cluster_kwargs
882 )
883 # Connect to cluster
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _submit(self, cluster_options, **kwargs)
475 headers=HTTPHeaders({"Content-type": "application/json"}),
476 )
--> 477 resp = await self._fetch(req)
478 data = json.loads(resp.body)
479 return data["name"]
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _fetch(self, req)
344 try:
345 self._cookie_jar.pre_request(req)
--> 346 resp = await client.fetch(req, raise_error=False)
347 if resp.code == 401:
348 context = self.auth.pre_request(req, resp)
/srv/conda/envs/pangeo/lib/python3.7/site-packages/tornado/simple_httpclient.py in run(self)
334 ssl_options=ssl_options,
335 max_buffer_size=self.max_buffer_size,
--> 336 source_ip=source_ip,
337 )
338
/srv/conda/envs/pangeo/lib/python3.7/site-packages/tornado/tcpclient.py in connect(self, host, port, af, ssl_options, max_buffer_size, source_ip, source_port, timeout)
292 else:
293 stream = await stream.start_tls(
--> 294 False, ssl_options=ssl_options, server_hostname=host
295 )
296 return stream
/srv/conda/envs/pangeo/lib/python3.7/site-packages/tornado/iostream.py in _do_ssl_handshake(self)
1415 self._handshake_reading = False
1416 self._handshake_writing = False
-> 1417 self.socket.do_handshake()
1418 except ssl.SSLError as err:
1419 if err.args[0] == ssl.SSL_ERROR_WANT_READ:
/srv/conda/envs/pangeo/lib/python3.7/ssl.py in do_handshake(self, block)
1137 if timeout == 0.0 and block:
1138 self.settimeout(None)
-> 1139 self._sslobj.do_handshake()
1140 finally:
1141 self.settimeout(timeout)
SSLError: [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:1076)
OR using the hub login address
gateway = Gateway(address='https://aws-uswest2.pangeo.io/services/dask-gateway',
proxy_address='tls://XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com:8786',
auth='jupyterhub')
---------------------------------------------------------------------------
HTTPClientError Traceback (most recent call last)
<ipython-input-26-bf15802dc141> in <module>
----> 1 cluster = gateway.new_cluster()
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in new_cluster(self, cluster_options, shutdown_on_close, **kwargs)
589 cluster_options=cluster_options,
590 shutdown_on_close=shutdown_on_close,
--> 591 **kwargs,
592 )
593
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in __init__(self, address, proxy_address, auth, cluster_options, shutdown_on_close, asynchronous, loop, **kwargs)
759 shutdown_on_close=shutdown_on_close,
760 asynchronous=asynchronous,
--> 761 loop=loop,
762 )
763
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _init_internal(self, address, proxy_address, auth, cluster_options, cluster_kwargs, shutdown_on_close, asynchronous, loop, name)
851 self.status = "starting"
852 if not self.asynchronous:
--> 853 self.gateway.sync(self._start_internal)
854
855 @property
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in sync(self, func, *args, **kwargs)
310 )
311 try:
--> 312 return future.result()
313 except BaseException:
314 future.cancel()
/srv/conda/envs/pangeo/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
433 raise CancelledError()
434 elif self._state == FINISHED:
--> 435 return self.__get_result()
436 else:
437 raise TimeoutError()
/srv/conda/envs/pangeo/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _start_internal(self)
865 self._start_task = asyncio.ensure_future(self._start_async())
866 try:
--> 867 await self._start_task
868 except BaseException:
869 # On exception, cleanup
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _start_async(self)
879 self.status = "starting"
880 self.name = await self.gateway._submit(
--> 881 cluster_options=self._cluster_options, **self._cluster_kwargs
882 )
883 # Connect to cluster
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _submit(self, cluster_options, **kwargs)
475 headers=HTTPHeaders({"Content-type": "application/json"}),
476 )
--> 477 resp = await self._fetch(req)
478 data = json.loads(resp.body)
479 return data["name"]
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _fetch(self, req)
366 raise GatewayServerError(msg)
367 else:
--> 368 resp.rethrow()
369 except HTTPError as exc:
370 # Tornado 6 still raises these above with raise_error=False
/srv/conda/envs/pangeo/lib/python3.7/site-packages/tornado/httpclient.py in rethrow(self)
675 """If there was an error on the request, raise an `HTTPError`."""
676 if self.error:
--> 677 raise self.error
678
679 def __repr__(self) -> str:
HTTPClientError: HTTP 503: Service Unavailable
Oh wait, I might have misunderstood what the proxy-public
service was. Is that not the jupyterhub proxy?
Try without routing through jupyterhub to see if the service is up:
gateway = Gateway(address='http://XXXX(WEB-PUBLIC).us-west-2.elb.amazonaws.com',
proxy_address='tls://XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com:8786',
auth='jupyterhub')
gateway.list_clusters()
The address you pass to address
should be a valid http(s) address that will eventually reach the gateway api server. Normally this is through the gateway's web-proxy (the web-public
service above). If you're routing through jupyterhub's proxy, then this is jupyterhub's proxy address with the gateway's service path added (/services/dask-gateway/
).
We got ours working using the web-public-XXXX-prod-dask-gateway external IP address for the gateway, not the proxy-public (JHub) address. I have not set up a static address yet, so not using DNS. Probably not an issue for you, but it never hurts to cut out DNS when troubleshooting stuff like this. You can also use curl from inside the cluster to see who is up and serving what. Some hints on that here: https://github.com/dask/dask-gateway/issues/191. I'm just learning dask-gateway but happy to keep thinking on this with you.
Thanks @tjcrone and @jcrist. The proxying and network stuff is well out of my wheel-house so I feel I'm grasping in the dark here!
gateway = Gateway(address='http://XXXX(WEB-PUBLIC).us-west-2.elb.amazonaws.com',
proxy_address='tls://XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com:8786',
auth='jupyterhub')
gateway.list_clusters()
Gives this traceback:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-0ecbfbe3dfc3> in <module>
2 proxy_address='tls://(REDACTED).us-west-2.elb.amazonaws.com:8786',
3 auth='jupyterhub')
----> 4 gateway.list_clusters()
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in list_clusters(self, status, **kwargs)
408 clusters : list of ClusterReport
409 """
--> 410 return self.sync(self._clusters, status=status, **kwargs)
411
412 def get_cluster(self, cluster_name, **kwargs):
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in sync(self, func, *args, **kwargs)
310 )
311 try:
--> 312 return future.result()
313 except BaseException:
314 future.cancel()
/srv/conda/envs/pangeo/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
433 raise CancelledError()
434 elif self._state == FINISHED:
--> 435 return self.__get_result()
436 else:
437 raise TimeoutError()
/srv/conda/envs/pangeo/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _clusters(self, status)
387 url = "%s/gateway/api/clusters/%s" % (self.address, query)
388 req = HTTPRequest(url=url)
--> 389 resp = await self._fetch(req)
390 return [
391 ClusterReport._from_json(self._public_address, self.proxy_address, r)
/srv/conda/envs/pangeo/lib/python3.7/site-packages/dask_gateway/client.py in _fetch(self, req)
360
361 if resp.code in {404, 422}:
--> 362 raise ValueError(msg)
363 elif resp.code == 409:
364 raise GatewayClusterError(msg)
ValueError: Not Found
but interestingly adding /services/dask-gateway to the XXXX(WEB-PUBLIC) succeeds and returns an empty list!
gateway = Gateway(address='http://XXXX(WEB-PUBLIC).us-west-2.elb.amazonaws.com/services/dask-gateway/',
proxy_address='tls://XXXX(SCHEDULER-PUBLIC).us-west-2.elb.amazonaws.com:8786',
auth='jupyterhub')
gateway.list_clusters()
This is exactly how things went for me, and it only started working once you told me to add /services/dask-gateway/
to the address. Getting an empty list here is good! What happens when you try to start, scale up, and connect to a cluster:
cluster = gateway.new_cluster()
cluster.scale(8)
client = cluster.get_client()
client
Yep @tjcrone it works!... but I'm not really following the changes here https://github.com/pangeo-data/pangeo-cloud-federation/pull/520. My understanding was that without the https:// mapping to jupyterhub proxy the dask labextension can't connect to a cluster started via dask-gateway?
I'm confused about enabling https in general for the dask-gateway LoadBalancers. In jupyterhub there is a more explicit mapping:
jupyterhub:
proxy:
https:
hosts:
- aws-uswest2.pangeo.io
letsencrypt:
contactEmail: scottyh@uw.edu
service:
loadBalancerIP: XXXX(PROXY-PUBLIC).us-west-2.elb.amazonaws.com
@scottyhq, I'm still working on getting the Dask labextension working (and the dashboard), and making sure that TLS is working all the way through. I'll keep you posted as I learn more.
One more thing I noticed looking at pod logs in case it's helpful is that the hub is loaded with Unexpected error connecting to web-public-dev-staging-dask-gateway:80
messages
kubectl logs hub-7844c8f9cb-2jrhf -n icesat2-prod
[I 2020-03-18 01:12:14.804 JupyterHub proxy:320] Checking routes
[E 2020-03-18 01:12:14.916 JupyterHub utils:75] Unexpected error connecting to web-public-dev-staging-dask-gateway:80 [Errno -2] Name or service not known
[E 2020-03-18 01:12:14.988 JupyterHub utils:75] Unexpected error connecting to web-public-dev-staging-dask-gateway:80 [Errno -2] Name or service not known
[E 2020-03-18 01:12:15.304 JupyterHub utils:75] Unexpected error connecting to web-public-dev-staging-dask-gateway:80 [Errno -2] Name or service not known
[E 2020-03-18 01:12:15.804 JupyterHub utils:75] Unexpected error connecting to web-public-dev-staging-dask-gateway:80 [Errno -2] Name or service not known
[W 2020-03-18 01:12:15.804 JupyterHub app:1903] Cannot connect to external service dask-gateway at http://web-public-dev-staging-dask-gateway
Same thing on our cluster.
@scottyhq, I am also getting the 503 Service Unavailable when trying to start dask-gateway through proxy-public. And when I try to start a cluster using web-public-ooi-prod-dask-gateway, I need to add /services/dask-gateway, which is interesting. This doesn't seem to align with the suggestions from @jcrist or @TomAugspurger, or the docs for that matter. So I'm not sure what is going on.
This doesn't seem to align with the suggestions from @jcrist or @TomAugspurger, or the docs for that matter. So I'm not sure what is going on.
Apologies, I forgot about this bit.
The configuration here is to run dask-gateway as a JupyterHub service, which involves proxying requests through JupyterHub's proxy (and can thus rely on JupyterHub for HTTPS support). So a request looks like:
user -> jupyterhub-proxy (HTTPS) -> dask-gateway-proxy (HTTP) -> dask-gateway-server (HTTP)
Services are proxied on /services/{service-name}
- in this case this is /services/dask-gateway/
. Because JupyterHub's proxy doesn't support stripping prefixes, requests arrive to the dask-gateway-proxy
with the /services/dask-gateway/
prefix (e.g. what would normally be a /api/clusters/
call is now a /services/dask-gateway/api/clusters/
call). To support this, dask-gateway has been configured to run with the same prefix prepended for all routes (this is the bit I forgot about when making suggestions above). So you do need the /services/dask-gateway/
path whether you're connecting through the JupyterHub proxy or directly through the dask-gateway proxy.
From the above, it looks like you're successfully connecting and running when handled directly, but unsuccessful when connecting through the JupyterHub proxy. Is the service configured properly? Is your configuration available somewhere so I can help debug?
One more thing I noticed looking at pod logs in case it's helpful is that the hub is loaded with Unexpected error connecting to web-public-dev-staging-dask-gateway:80 messages
I assume this happens at startup, but not later in the logs? When JupyterHub is starting up it tries to connect to all registered services (it also performs these checks periodically). If dask-gateway is also starting up at the same time, it may fail these checks initially (since it isn't running yet). Shouldn't be anything to worry about.
@jcrist, thank you for this helpful clarification. I the bulk of our configuration options are here: https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/pangeo-deploy/values.yaml. The per-deployment configs are, as an example, here: https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/ooi/config/common.yaml. Looks to me like in our deployment (ooi) we are not specifying that the service start, as @TomAugspurger did here:
Ah, yeah, without the url dask-gateway wont be proxied behind the jupyterhub proxy.
We might be able to automate that configuration with some helm magic so that each pangeo deployment doesn't need to manually add that url. For now though that's likely http://web-public-ooi-staging-dask-gateway
or something like that.
Okay that seemed to work @jcrist! I'm not sure why, but our version of dask-gateway appears to be 0.3.0, at least in my image, and that was a problem that I solved temporarily with a conda install. But I still cannot access the dashboard. Now getting a 500 Internal Server Error.
Oh I see, the cluster object provides the right dashboard url, but the client object does not. Getting closer!!
Making tons of progress thanks to help from all y'alls. Thank you very much! Here's a comment in my recent PR that probably belongs in this thread: https://github.com/pangeo-data/pangeo-cloud-federation/pull/568#issuecomment-600843647
Confirmed this is now working on AWS with
gateway = Gateway(address='https://staging.aws-uswest2.pangeo.io/services/dask-gateway',
proxy_address='tls://scheduler-public-icesat2-staging-dask-gateway:8786',
auth='jupyterhub')
@jhamman and @TomAugspurger. Is it possible to connect to the gateway in GKE from aws-uswest2.pangeo.io with the current setup? What address, proxy, and auth would need to be provided in that case?
Good question @scottyhq! By default, things don't quite work as easily as
gateway = Gateway(address='https://staging.aws-uswest2.pangeo.io/services/dask-gateway',
proxy_address='tls://scheduler-public-icesat2-staging-dask-gateway:8786',
auth='jupyterhub')
because the two jupyterhubs have their own API tokens. My GKE API token stored at JUPYTERHUB_API_TOKEN
isn't valid for the aws-uswest2 hub. However, if I manually create an API token on the AWS cluster at https://staging.aws-uswest2.pangeo.io/hub/token
from dask_gateway import Gateway
from dask_gateway.auth import JupyterHubAuth
auth = JupyterHubAuth(api_token="<my-token>")
gateway = Gateway(address='https://staging.aws-uswest2.pangeo.io/services/dask-gateway',
proxy_address='tls://scheduler-public-icesat2-staging-dask-gateway:8786',
auth=auth)
Then things work! It'd be great if there were a way to "sync" API tokens between two hubs, since we're both using GitHub for auth, but I don't know if that's possible.
Well, things kinda work. I'm able to connect to the us-west gateway, but I can't create a cluster.
cluster = gateway.new_cluster()
---------------------------------------------------------------------------
GatewayClusterError Traceback (most recent call last)
<ipython-input-21-a899aa24fb70> in <module>
----> 1 cluster = gateway.new_cluster()
2 cluster
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in new_cluster(self, cluster_options, shutdown_on_close, **kwargs)
589 cluster_options=cluster_options,
590 shutdown_on_close=shutdown_on_close,
--> 591 **kwargs,
592 )
593
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in __init__(self, address, proxy_address, auth, cluster_options, shutdown_on_close, asynchronous, loop, **kwargs)
759 shutdown_on_close=shutdown_on_close,
760 asynchronous=asynchronous,
--> 761 loop=loop,
762 )
763
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _init_internal(self, address, proxy_address, auth, cluster_options, cluster_kwargs, shutdown_on_close, asynchronous, loop, name)
851 self.status = "starting"
852 if not self.asynchronous:
--> 853 self.gateway.sync(self._start_internal)
854
855 @property
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in sync(self, func, *args, **kwargs)
310 )
311 try:
--> 312 return future.result()
313 except BaseException:
314 future.cancel()
/srv/conda/envs/notebook/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
433 raise CancelledError()
434 elif self._state == FINISHED:
--> 435 return self.__get_result()
436 else:
437 raise TimeoutError()
/srv/conda/envs/notebook/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _start_internal(self)
865 self._start_task = asyncio.ensure_future(self._start_async())
866 try:
--> 867 await self._start_task
868 except BaseException:
869 # On exception, cleanup
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _start_async(self)
883 # Connect to cluster
884 try:
--> 885 report = await self.gateway._wait_for_start(self.name)
886 except GatewayClusterError:
887 raise
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _wait_for_start(self, cluster_name)
524 raise GatewayClusterError(
525 "Cluster %r failed to start, see logs for "
--> 526 "more information" % cluster_name
527 )
528 elif report.status is ClusterStatus.STOPPED:
GatewayClusterError: Cluster '545e13e9207d4dada507bf4a58adce79' failed to start, see logs for more information
I don't have access to the logs.
If things are set up to use your notebook image by default, could it be that the image isn't available on the other instal (e.g. using Amazon's/Google's image registry).
We could set up multiple gateway's for a single hub instance (and I have some ideas for how to do it for multiple hubs per gateway as well) but the current setup is really optimized for the 1:1 case. Manually getting a token does work, but isn't ideal.
Thanks for looking into this @TomAugspurger and @jcrist. I think this ability to connect to various clusters from a single hub would be extremely powerful, but it does also introduce more complications. In particular it opens up the possibility of users streaming a lot of data between cloud-providers and incurring huge egress costs. So for now sticking with the 1:1 options seems best. In the future though, is there a way to set per-user network transfer limits in a similar way to CPU/RAM limits?
Then things work! It'd be great if there were a way to "sync" API tokens between two hubs, since we're both using GitHub for auth, but I don't know if that's possible.
At least for hubs managed in this repo, we can use the same API tokens in a shared secrets file, right? Security wise that's probably not ideal, but maybe we can rotate them periodically.
If things are set up to use your notebook image by default, could it be that the image isn't available on the other instal (e.g. using Amazon's/Google's image registry).
Good point! This is another argument for storing all our images on DockerHub. In our case I think we don't really need private registries for these jupyterhub images.
At least for hubs managed in this repo, we can use the same API tokens in a shared secrets file, right? Security wise that's probably not ideal, but maybe we can rotate them periodically.
I wondered about that. I think that the API key in question is randomly generated by the Hub, not by us. But it'd be good to verify that.
I think Jim is right that the image is likely the culprit. If I have time later today I'll try the other direction (connecting to the GKE hub from AWS), since I have access to the GKE logs.
Great progress lately on Dask-gateway. This will be a welcome improvement!
One thing I'm wondering about is a user's choice of worker image. I was under the impression, perhaps mistakenly, that Dask-gateway would allow us to control which images are run on worker nodes. Is this the case? Or instead, does the architecture of Dask-gateway reduce the risk of allowing users to deploy arbitrary images in the unlikely case that they would be used malevolently? I see that recent changes define the worker image in a user-space environment variable, so any help in understanding this would be greatly appreciated. Thanks!
I was under the impression, perhaps mistakenly, that Dask-gateway would allow us to control which images are run on worker nodes. Is this the case?
You're correct. Currently we provide users the option to pick an arbitrary image, and set the default to '{JUPYTER_IMAGE_SPEC}'
, the image their hub singleuser instance is using.
You can override the option_handler
to provide a whitelist of images to pick from.
I'm able to use Dask-Gateway to connect to & create clusters from staging dev, ocean, and hydro. With https://github.com/pangeo-data/pangeo-cloud-federation/pull/576, things should be ready on prod as well (after a staging -> prod merge).
The version of distributed in hydro
is a bit old to connect a client to it. It would require GatewayCluster.get_client()
instead. Is hydro still in use?
Everything seems to be working well on prod. I think we're good here.
@jhamman @TomAugspurger - While running some additional tests I'm realizing one drawback of the putting the schedulers on a separate nodegroup that scales to zero (https://github.com/pangeo-data/pangeo-cloud-federation/pull/569) is that it can take a long time for the new_cluster()
command to complete. The main issue is there is no feedback, just seems like the kernel is hanging for 5minutes:
%%time
from dask_gateway import Gateway
from dask.distributed import Client, progress
gateway = Gateway()
cluster = gateway.new_cluster()
CPU times: user 76.3 ms, sys: 30.7 ms, total: 107 ms Wall time: 5min 29s
Yeah I’ve noticed that as well. I haven’t come up with a better alternative than always keeping 1 VM around.
On Apr 3, 2020, at 17:32, Scott Henderson notifications@github.com wrote:
@jhamman @TomAugspurger - While running some additional tests I'm realizing one drawback of the putting the schedulers on a separate nodegroup that scales to zero (#569) is that it can take a long time for the new_cluster() command to complete. The main issue is there is no feedback, just seems like the kernel is hanging for 5minutes:
%%time from dask_gateway import Gateway from dask.distributed import Client, progress
gateway = Gateway() cluster = gateway.new_cluster() CPU times: user 76.3 ms, sys: 30.7 ms, total: 107 ms Wall time: 5min 29s
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
We're in the process of testing and customizing dask-gateway. Here is a checklist (from #481) for todo items that we can work on. We can use this issue to track progress on these plus any additional configurations we see as necessary.
image: ${JUPYTER_IMAGE_SPEC}
(~https://github.com/pangeo-data/pangeo-cloud-federation/pull/506~, ~https://github.com/pangeo-data/pangeo-cloud-federation/pull/507~, https://github.com/pangeo-data/pangeo-cloud-federation/pull/518)