Closed rsignell-usgs closed 6 years ago
Did anything look interesting when looking at the dashboard? How was memory use during the computation? Did the upper left plot (memory use per worker) ever move away from blue?
@mrocklin while running cell [16] the Bytes stored
number keeps growing, and after completing cell [16], the status looks like this:
(incidentally, the first time I reran cell [16], I got this error:
File "/opt/conda/lib/python3.6/site-packages/zarr/core.py", line 1730, in _decode_chunk
chunk = self._compressor.decode(cdata)
File "numcodecs/blosc.pyx", line 483, in numcodecs.blosc.Blosc.decode
File "numcodecs/blosc.pyx", line 394, in numcodecs.blosc.decompress
RuntimeError: error during blosc decompression: -1
but I just reran it again and didn't get the error)
When I run cell [20] it just dies immediately with nothing interesting on the dashboard.
I think folks should be able to run this notebook on pangeo without changes if they are interested -- they would just have to install the utide tidal analysis package: conda install -c conda-forge utide
Here's the notebook URL again: http://nbviewer.jupyter.org/gist/rsignell-usgs/91ff0dabe41f354c2677a61db49a8909
The runtime error looks important? It might be worth trying to track that down.
@mrocklin , I have not been able to reproduce the that RuntimeError
in cell [16]. So it's a bit troubling, but not causing problems right now.
I can, however, reproduce the error in cell [20] every time.
The stack.compute()
command produces:
distributed.utils - ERROR - ("('from-value-getitem-solve-stack-60c42b5b6cf2fdb7b45148aa6f6a019e', 10, 0)", 'tcp://10.22.79.19:39899')
Traceback (most recent call last):
File "/home/jovyan/my-conda-envs/IOOS/lib/python3.6/site-packages/distributed/utils.py", line 238, in f
result[0] = yield make_coro()
File "/home/jovyan/my-conda-envs/IOOS/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/home/jovyan/my-conda-envs/IOOS/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/home/jovyan/my-conda-envs/IOOS/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/home/jovyan/my-conda-envs/IOOS/lib/python3.6/site-packages/distributed/client.py", line 1364, in _gather
traceback)
File "/home/jovyan/my-conda-envs/IOOS/lib/python3.6/site-packages/six.py", line 686, in reraise
raise value
distributed.scheduler.KilledWorker: ("('from-value-getitem-solve-stack-60c42b5b6cf2fdb7b45148aa6f6a019e', 10, 0)", 'tcp://10.22.79.19:39899')
@ocefpaf confirmed the same problem running on pangeo also. It is possible to run the notebook on pangeo after installing utide
via conda install -c conda-forge utide
.
Yeah, I'm afraid that there are too many moving pieces for me to easily tell what the problem is. The KilledWorker exception is telling you that several workers died hard when trying to run that task. That could be for a variety of reasons like segfaults, out-of-memory errors, system libraries not behaving well when imported within a thread, etc.. Unfortunately it's somewhat difficult to inspect the worker after it died and learn about what happened.
Typically I tend to hope that the system was able to dump something to logs before it crashed. You might try inspecting the logs through Kubernetes or the cluster.logs/cluster.pods methods to see if anything was dumped out there.
In my experience this sort of thing is usually due to running out of memory.
Okay, looking at the logs was a good idea. Go figure. :smiley_cat: doing:
pods = cluster.pods()
print(cluster.logs(pods[0]))
revealed:
no environment.yml
Mounting pangeo-data to /gcs
+ '[' -e /opt/app/environment.yml ']'
+ echo 'no environment.yml'
+ '[' '' ']'
+ '[' '' ']'
+ '[' pangeo-data ']'
+ echo 'Mounting pangeo-data to /gcs'
+ /opt/conda/bin/gcsfuse pangeo-data /gcs --background
+ dask-worker --nthreads 2 --no-bokeh --memory-limit 6GB --death-timeout 60
distributed.nanny - INFO - Start Nanny at: 'tcp://10.22.163.5:44611'
distributed.worker - INFO - Start worker at: tcp://10.22.163.5:32992
distributed.worker - INFO - Listening to: tcp://10.22.163.5:32992
distributed.worker - INFO - nanny at: 10.22.163.5:44611
distributed.worker - INFO - Waiting to connect to: tcp://10.22.79.18:45822
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 2
distributed.worker - INFO - Memory: 6.00 GB
...
...
ModuleNotFoundError: No module named 'utide'
distributed.worker - ERROR - No module named 'utide'
So I guess the workers are not seeing my custom kernel (with utide
installed) that is running the notebook.
So I then tried just doing conda install utide
into the conda [root]
environment, and using that kernel instead. I was surprised to get the same problem, however -- the workers can't find utide
.
The workers don't have access to your local environments at all. They are using a separate docker image entirely. Changing things locally will not affect them. You can ask workers to install libraries by modifying the worker-template.yaml script
env:
- name: GCSFUSE_BUCKET
value: pangeo-data
- name: EXTRA_CONDA_PACKAGES
value: utide -c conda-forge
you can also pass those options to KubeCluster, correct?
Yes, I think so
from dask_kubernetes import KubeCluster
cluster = KubeCluster.from_yaml('worker-template.yaml')
Woohoo, it worked! Thanks for the help @mrocklin and @rabernat !
jovyan@jupyter-rsignell-2dusgs:~$ more ~/worker-template.yaml
metadata:
spec:
restartPolicy: Never
containers:
- args:
- dask-worker
- --nthreads
- '2'
- --no-bokeh
- --memory-limit
- 6GB
- --death-timeout
- '60'
image: pangeo/worker:2018-03-11
name: dask-worker
securityContext:
capabilities:
add: [SYS_ADMIN]
privileged: true
env:
- name: GCSFUSE_BUCKET
value: pangeo-data
- name: EXTRA_CONDA_PACKAGES
value: utide -c conda-forge
resources:
limits:
cpu: "1.75"
memory: 6G
requests:
cpu: "1.75"
memory: 6G
I'm running a notebook on pangeo that worked okay on our HPC, but is dying here with a "KilledWorker" error.
The full notebook is here: http://nbviewer.jupyter.org/gist/rsignell-usgs/91ff0dabe41f354c2677a61db49a8909 and the error is in cell [20].
It's quite possible I'm doing something fundamentally wrong here in how I'm trying to use dask. Basically I'm trying to run a tidal analysis at each grid point (or every
n
th gridpoint) which means marching through the 3D (time, lat, lon) model output, extracting time series at each location. I am trying to do the time series analysis in parallel with dask.The last step is just extracting a single coefficient from the results of the tidal analysis, and then putting these back on the lon,lat grid so I can plot them up (so I can see a map of the M2 tidal elevation for this model grid).
Since this ran (albeit slowly) on our HPC system, but died here on pangeo, I'd guess I'm not only using dask in a suboptimal way, but hitting some other problem (the killed worker).