pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
701 stars 189 forks source link

Dask problem with KilledWorker: ("('from-value-getitem-...", 'tcp://10.22.53.6:39166') #169

Closed rsignell-usgs closed 6 years ago

rsignell-usgs commented 6 years ago

I'm running a notebook on pangeo that worked okay on our HPC, but is dying here with a "KilledWorker" error.

The full notebook is here: http://nbviewer.jupyter.org/gist/rsignell-usgs/91ff0dabe41f354c2677a61db49a8909 and the error is in cell [20].

It's quite possible I'm doing something fundamentally wrong here in how I'm trying to use dask. Basically I'm trying to run a tidal analysis at each grid point (or every nth gridpoint) which means marching through the 3D (time, lat, lon) model output, extracting time series at each location. I am trying to do the time series analysis in parallel with dask.

The last step is just extracting a single coefficient from the results of the tidal analysis, and then putting these back on the lon,lat grid so I can plot them up (so I can see a map of the M2 tidal elevation for this model grid).

Since this ran (albeit slowly) on our HPC system, but died here on pangeo, I'd guess I'm not only using dask in a suboptimal way, but hitting some other problem (the killed worker).

mrocklin commented 6 years ago

Did anything look interesting when looking at the dashboard? How was memory use during the computation? Did the upper left plot (memory use per worker) ever move away from blue?

rsignell-usgs commented 6 years ago

@mrocklin while running cell [16] the Bytes stored number keeps growing, and after completing cell [16], the status looks like this: 2018-03-19_17-10-12

(incidentally, the first time I reran cell [16], I got this error:

  File "/opt/conda/lib/python3.6/site-packages/zarr/core.py", line 1730, in _decode_chunk
    chunk = self._compressor.decode(cdata)
  File "numcodecs/blosc.pyx", line 483, in numcodecs.blosc.Blosc.decode
  File "numcodecs/blosc.pyx", line 394, in numcodecs.blosc.decompress
RuntimeError: error during blosc decompression: -1

but I just reran it again and didn't get the error)

When I run cell [20] it just dies immediately with nothing interesting on the dashboard.

I think folks should be able to run this notebook on pangeo without changes if they are interested -- they would just have to install the utide tidal analysis package: conda install -c conda-forge utide

Here's the notebook URL again: http://nbviewer.jupyter.org/gist/rsignell-usgs/91ff0dabe41f354c2677a61db49a8909

mrocklin commented 6 years ago

The runtime error looks important? It might be worth trying to track that down.

rsignell-usgs commented 6 years ago

@mrocklin , I have not been able to reproduce the that RuntimeError in cell [16]. So it's a bit troubling, but not causing problems right now.

I can, however, reproduce the error in cell [20] every time.

The stack.compute() command produces:

distributed.utils - ERROR - ("('from-value-getitem-solve-stack-60c42b5b6cf2fdb7b45148aa6f6a019e', 10, 0)", 'tcp://10.22.79.19:39899')
Traceback (most recent call last):
  File "/home/jovyan/my-conda-envs/IOOS/lib/python3.6/site-packages/distributed/utils.py", line 238, in f
    result[0] = yield make_coro()
  File "/home/jovyan/my-conda-envs/IOOS/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/jovyan/my-conda-envs/IOOS/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/jovyan/my-conda-envs/IOOS/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/jovyan/my-conda-envs/IOOS/lib/python3.6/site-packages/distributed/client.py", line 1364, in _gather
    traceback)
  File "/home/jovyan/my-conda-envs/IOOS/lib/python3.6/site-packages/six.py", line 686, in reraise
    raise value
distributed.scheduler.KilledWorker: ("('from-value-getitem-solve-stack-60c42b5b6cf2fdb7b45148aa6f6a019e', 10, 0)", 'tcp://10.22.79.19:39899')

@ocefpaf confirmed the same problem running on pangeo also. It is possible to run the notebook on pangeo after installing utide via conda install -c conda-forge utide.

mrocklin commented 6 years ago

Yeah, I'm afraid that there are too many moving pieces for me to easily tell what the problem is. The KilledWorker exception is telling you that several workers died hard when trying to run that task. That could be for a variety of reasons like segfaults, out-of-memory errors, system libraries not behaving well when imported within a thread, etc.. Unfortunately it's somewhat difficult to inspect the worker after it died and learn about what happened.

Typically I tend to hope that the system was able to dump something to logs before it crashed. You might try inspecting the logs through Kubernetes or the cluster.logs/cluster.pods methods to see if anything was dumped out there.

rabernat commented 6 years ago

In my experience this sort of thing is usually due to running out of memory.

rsignell-usgs commented 6 years ago

Okay, looking at the logs was a good idea. Go figure. :smiley_cat: doing:

pods = cluster.pods()
print(cluster.logs(pods[0]))

revealed:

no environment.yml
Mounting pangeo-data to /gcs
+ '[' -e /opt/app/environment.yml ']'
+ echo 'no environment.yml'
+ '[' '' ']'
+ '[' '' ']'
+ '[' pangeo-data ']'
+ echo 'Mounting pangeo-data to /gcs'
+ /opt/conda/bin/gcsfuse pangeo-data /gcs --background
+ dask-worker --nthreads 2 --no-bokeh --memory-limit 6GB --death-timeout 60
distributed.nanny - INFO -         Start Nanny at: 'tcp://10.22.163.5:44611'
distributed.worker - INFO -       Start worker at:    tcp://10.22.163.5:32992
distributed.worker - INFO -          Listening to:    tcp://10.22.163.5:32992
distributed.worker - INFO -              nanny at:          10.22.163.5:44611
distributed.worker - INFO - Waiting to connect to:    tcp://10.22.79.18:45822
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                    6.00 GB
...
...
ModuleNotFoundError: No module named 'utide'
distributed.worker - ERROR - No module named 'utide'

So I guess the workers are not seeing my custom kernel (with utide installed) that is running the notebook.

So I then tried just doing conda install utide into the conda [root] environment, and using that kernel instead. I was surprised to get the same problem, however -- the workers can't find utide.

mrocklin commented 6 years ago

The workers don't have access to your local environments at all. They are using a separate docker image entirely. Changing things locally will not affect them. You can ask workers to install libraries by modifying the worker-template.yaml script

    env:
      - name: GCSFUSE_BUCKET
        value: pangeo-data
      - name: EXTRA_CONDA_PACKAGES
        value: utide -c conda-forge
rabernat commented 6 years ago

you can also pass those options to KubeCluster, correct?

mrocklin commented 6 years ago

Yes, I think so

rsignell-usgs commented 6 years ago
from dask_kubernetes import KubeCluster
cluster = KubeCluster.from_yaml('worker-template.yaml')
rsignell-usgs commented 6 years ago

Woohoo, it worked! Thanks for the help @mrocklin and @rabernat !

jovyan@jupyter-rsignell-2dusgs:~$ more ~/worker-template.yaml
metadata:
spec:
  restartPolicy: Never
  containers:
  - args:
      - dask-worker
      - --nthreads
      - '2'
      - --no-bokeh
      - --memory-limit
      - 6GB
      - --death-timeout
      - '60'
    image: pangeo/worker:2018-03-11
    name: dask-worker
    securityContext:
      capabilities:
        add: [SYS_ADMIN]
      privileged: true
    env:
      - name: GCSFUSE_BUCKET
        value: pangeo-data
      - name: EXTRA_CONDA_PACKAGES
        value: utide -c conda-forge
    resources:
      limits:
        cpu: "1.75"
        memory: 6G
      requests:
        cpu: "1.75"
        memory: 6G