Debugging memory issues

TomAugspurger commented 3 years ago

@sharkinsspatial dumping some notes below. Let me know if you want to jump on a call to discuss. I'm current seeing workers hanging as they try to acquire a lock in pangeo-forge. I'll see what I can do to debug.

Summarizing the issues we're seeing with pangeo-forge.

** Changes to environment

Updated to latest released Dask, distributed, fsspec, adlfs. Installed pangeo-forge-recipes from GitHub.

** Cloud build

If you want to build the images on Azure, avoid upload. Not a huge benefit, if we have to download them to submit.

diff --git a/scripts/build_and_push_image.sh b/scripts/build_and_push_image.sh
index 36a8226..d6b7809 100755
--- a/scripts/build_and_push_image.sh
+++ b/scripts/build_and_push_image.sh
@@ -1,15 +1,15 @@
 #!/bin/bash
+set -eu

 REGISTRY_NAME=$(terraform -chdir="terraform" output -raw bakery_image_registry_name)
 LOCAL_TAG=pangeo-forge-azure-bakery-image
 REMOTE_TAG=$REGISTRY_NAME.azurecr.io/$LOCAL_TAG:latest

 az acr login --name $REGISTRY_NAME
+az acr build --registry $REGISTRY_NAME --image $LOCAL_TAG:latest ./images/

-docker build -t $LOCAL_TAG -f ./images/Dockerfile ./images/
-
-docker tag $LOCAL_TAG $REMOTE_TAG
-
-docker push $REMOTE_TAG
+# docker build -t $LOCAL_TAG -f ./images/Dockerfile ./images/
+# docker tag $LOCAL_TAG $REMOTE_TAG
+# docker push $REMOTE_TAG

 python3 ./scripts/replace_or_insert_value.py $REMOTE_TAG AZURE_BAKERY_IMAGE

** Access Dask Dashboard

One of the dask-root pods prefect starts is the scheduler pod.

kubectl -n pangeo-forge port-forward `kubectl -n pangeo-forge get pod -l dask.org/component=scheduler -o name` 8786 8787

** Add logs to worker handler

This makes the logs accessible from the Dask UI. I'm sure there's a better way to do this.

def set_log_level(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        level = logging.INFO

        logger = logging.getLogger("pangeo_forge_recipes")
        logger.setLevel(level)
        handler = logging.StreamHandler()
        handler.setLevel(level)
        formatter = logging.Formatter("[%(asctime)s - %(levelname)s - %(filename)s:%(lineno)s - %(funcName)10s() ] %(message)s")
        handler.setFormatter(formatter)
        logger.addHandler(handler)

        try:
            worker = get_worker()
            worker._setup_logging(logger)
        except ValueError:
            pass

        result = func(*args, **kwargs)
        return result

    return wrapper

** Hanging Workers

I'm seeing some hanging, but perhaps different from what others saw. The worker logs say

pangeo_forge_recipes.storage - INFO - Opening 'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/201909/oisst-avhrr-v02r01.20190920.nc' from cache
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Combining inputs for chunk '(0,)'
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable time chunk (0,) to Zarr region (slice(0, 20, None),)
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable anom chunk (0,) to Zarr region (slice(0, 20, None), slice(None, None, None), slice(None, None, None), slice(None, None, None))
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable err chunk (0,) to Zarr region (slice(0, 20, None), slice(None, None, None), slice(None, None, None), slice(None, None, None))
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable ice chunk (0,) to Zarr region (slice(0, 20, None), slice(None, None, None), slice(None, None, None), slice(None, None, None))

Looking at the call stack

Call Stacks: tcp://10.244.0.164:44331
Worker: tcp://10.244.0.164:44331
Key: MappedTaskWrapper-0-0e2e432e5be94e1797ddf1cdda29b9fd
File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap self._bootstrap_inner()

File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run()

File "/usr/local/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs)

File "/usr/local/lib/python3.8/site-packages/distributed/threadpoolexecutor.py", line 55, in _worker task.run()

File "/usr/local/lib/python3.8/site-packages/distributed/_concurrent_futures_thread.py", line 66, in run result = self.fn(*self.args, **self.kwargs)

File "/usr/local/lib/python3.8/site-packages/distributed/worker.py", line 3722, in apply_function result = function(*args, **kwargs)

File "/usr/local/lib/python3.8/site-packages/prefect/engine/flow_runner.py", line 775, in run_task return task_runner.run(

File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 425, in wrapper return func(*args, **kwargs)

File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 354, in run end_state = super().run(

File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 425, in wrapper return func(*args, **kwargs)

File "/usr/local/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 298, in run state = self.get_task_run_state(state, inputs=task_inputs)

File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 71, in inner return runner_method(self, *args, **kwargs)

File "/usr/local/lib/python3.8/site-packages/prefect/engine/runner.py", line 48, in inner new_state = method(self, state, *args, **kwargs)

File "/usr/local/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 865, in get_task_run_state value = prefect.utilities.executors.run_task_with_timeout(

File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 323, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore

File "/oisst_recipe.py", line 39, in wrapper

File "/usr/local/lib/python3.8/site-packages/rechunker/executors/prefect.py", line 30, in run return self.stage.func(key)

File "/usr/local/lib/python3.8/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 325, in store_chunk with lock_for_conflicts(lock_keys):

File "/usr/local/lib/python3.8/contextlib.py", line 113, in __enter__ return next(self.gen)

File "/usr/local/lib/python3.8/site-packages/pangeo_forge_recipes/utils.py", line 101, in lock_for_conflicts lock.acquire()

File "/usr/local/lib/python3.8/site-packages/distributed/lock.py", line 135, in acquire result = self.client.sync(

File "/usr/local/lib/python3.8/site-packages/distributed/client.py", line 853, in sync return sync(

File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 351, in sync e.wait(10)

File "/usr/local/lib/python3.8/threading.py", line 558, in wait signaled = self._cond.wait(timeout)

File "/usr/local/lib/python3.8/threading.py", line 306, in wait gotit = waiter.acquire(True, timeout)

So seems like the issue is in lock_for_conflicts. Either a real deadlock, or an event loop issue.

TomAugspurger commented 3 years ago

With debug logs:

pangeo_forge_recipes.recipes.xarray_zarr - DEBUG - Acquiring locks ['time-23']

pangeo_forge_recipes.utils - DEBUG - Acquiring lock pangeo-forge-time-23...

pangeo_forge_recipes.utils - DEBUG - Acquired lock pangeo-forge-time-23

pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable time chunk (23,) to Zarr region (slice(460, 480, None),)

pangeo_forge_recipes.utils - DEBUG - Released lock pangeo-forge-time-23

pangeo_forge_recipes.recipes.xarray_zarr - DEBUG - Acquiring locks ['sst-23']

pangeo_forge_recipes.utils - DEBUG - Acquiring lock pangeo-forge-sst-23...

So AFAICT, the lock sst-23 was released prior that worker trying to acquire it.

TomAugspurger commented 3 years ago

~Debugging, the lock is in fact available:~ (this is wrong: see https://github.com/pangeo-forge/pangeo-forge-azure-bakery/issues/10#issuecomment-858639754)

# shell 1
$ kubectl port-forward -n pangeo-forge dask-root-b4bde08a-3rdfkz 8786

# shell 2
$ python
Python 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import distributed
>>> client = distributed.Client("tcp://127.0.0.1:8786")
>>> lock = distributed.Lock("sst-23")
>>> lock.acquire(timeout=5)
True
>>> lock.release()
>>>

I'd love to see what's going on on that worker... I think kubernetes lets you ssh into pods.

sharkinsspatial commented 3 years ago

@TomAugspurger As a note all of the tests in https://github.com/pangeo-forge/pangeo-forge-recipes/issues/151 were run using pangeo/pangeo-notebook:2021.05.04 as a base image with the following modifications.

xarray==0.18.0
pangeo-forge-recipes==0.3.4
prefect[aws, azure, google, kubernetes]==0.14.7

We did not experience any hanging worker issues beyond those outlined in https://github.com/pangeo-forge/pangeo-forge-recipes/issues/144 so I'm unsure if this is azure blob storage related or perhaps due to a change in the pangeo-forge-recipes main branch. Do you have a branch which shows the mapper implementation you were using?

@ciaranevans Can we investigate how to expose the external network interface for the worker pods and they key we'll need to set so Tom can ssh into a hung worker pod?

TomAugspurger commented 3 years ago

I maybe got it with kubectl exec -it -n pangeo-forge dask-root-b4bde08a-32rpqk -- /bin/bash The terminal was hanging though.

I did another run and this time the lock actually seemed to be claimed. I'm seeing if we can get a bit better information into distributed's locks, to figure out who has what.

ciaransweet commented 3 years ago

@TomAugspurger Had that pod died by the time you tried to connect?

It's not the case of Prefect/K8s killing it as soon as it errors?

TomAugspurger commented 3 years ago

Nope, it was still alive. The only pods I've seen killed are (I think) due to https://github.com/pangeo-forge/pangeo-forge-azure-bakery/issues/11.

ciaransweet commented 3 years ago

Hmm okay, I'm unsure how kubectl and getting on a running container works tbh. If it can't get a connection, does it hang or will it error

TomAugspurger commented 3 years ago

I did another run and this time the lock actually seemed to be claimed. I'm seeing if we can get a bit better information into distributed's locks, to figure out who has what.

I believe I was mistaken when I initially said that the lock was actually able to be acquired in https://github.com/pangeo-forge/pangeo-forge-azure-bakery/issues/10#issuecomment-858119458. Pangeo-forge prepends the key pangeo-forge to the locks it acquires, so I should have acquired pangeo-forge-sst-23 rather than just sst-23.

My guess right now is that some other worker failed to release that lock. I'll try to confirm that.

TomAugspurger commented 3 years ago

I just did a run with no locks (by commenting out the locking in https://github.com/TomAugspurger/pangeo-forge/commit/f22e73c95ae6ddbfe52bbaa89ac37732b364dea5) and it completed successfully. I think this recipe doesn't have any conflicting locks, but I need to confirm that . So a few things:

Locks not being released might be a real problem. We should work with distributed to provide better diagnostics. In particular, I wonder if the LockExtension object could be updated to use the client.id of the client that acquires the lock, and lock acquisitions / releases should be logged to the /events dashboard. ~2. I didn't observe any memory issues. This might be because we're using the latest versions of adlfs, dask, distributed, fsspec, and pangeo-forge-recipes. Or it might be because we're using adlfs rather than s3fs. @sharkinsspatial can you confirm: did you ever see memory issues with the Azure bakery (I think not, since you were hitting the adlfs / fsspec caching issue)?~ Sorry, I forgot I made the date_range smaller for debugging. I'll try with the full range now.
pangeo-forge-recipes might be able to disable locking, if it knows that there aren't any conflicts across the entire recipe. I'll see if this is doable.

rabernat commented 3 years ago

Locking should only be happening if it is actually needed for concurrent writes to the same chunk.

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/master/pangeo_forge_recipes/recipes/xarray_zarr.py#L305

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/d96c57ca8a286fa3504b385f344cf4a63a350eca/pangeo_forge_recipes/recipes/xarray_zarr.py#L325-L327

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/d96c57ca8a286fa3504b385f344cf4a63a350eca/pangeo_forge_recipes/utils.py#L103-L105

AFAIK, none of our usual test cases need any locking.

sharkinsspatial commented 3 years ago

@TomAugspurger If you would like a simpler example to debug worker memory leaks you can use https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/dask_memory_flow/flow_test/dask_memory_flow.py so that we can isolate this issue from any interactions with other dependencies.

TomAugspurger commented 3 years ago

FYI @sharkinsspatial I tried that dask_memory_flow.py flow and did not see any issues. Notes below

The scheduler pod is currently being killed for OOM (0d751e59ab38c6fc8906ec15e7e78d5fc76892dc, https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/50d52570-449c-4083-9fd3-cbc7efa0ea91)

The dask_memory_flow, which just maps a bunch of sleeps, looks pretty fine. https://cloud.prefect.io/flow-run/aab25c7f-093d-498a-9266-1d5f5560a0a4 / https://github.com/pangeo-forge/pangeo-forge-azure-bakery/commit/b20c361ea6a7dfca1d1c1e762ff314177edd09eb Memory use on the scheduler is essentially flat. Small jumps (2-3 MB) when a group of workers is added. Small increase per "tick" of 0.08MB. I wonder if that stabilizes.

Memory use climbed to ~380 - 385MiB by the end.

So the summary is that prefect seems to be doing OK with that number of tasks. I followed that up with a test that subclassed XarrayToZarr and override prepare_target and store_chunk to just time.sleep. The goal was to verify that it's the pangeo-forge recipe object causing issues with scheduler memory, rather than actual execution itself. With that setup, the scheduler memory jumped to 4.5 GiB within 25s, before being OOM-killed by kubernetes.

sharkinsspatial commented 3 years ago

@TomAugspurger 👍 Thanks for the investigation. Can you report the distributed, dask and prefect versions on the worker image used for your dask_memory_flow test? I'll run another AWS test using this version combination so that we can verify if this worker memory issue is specific to the dask-cloudprovider ECSCluster.

Based on your subclass experiment it appears that recipe serialization may be the culprit in our scheduler memory woes. What are your thoughts on next steps? If you want to continue to coordinate with @rabernat on testing and potential solutions for this, I'll work with the Prefect team to try and diagnose our AWS worker memory issues this week.

TomAugspurger commented 3 years ago

I'll verify, but this was using commit b20c361ea6a7dfca1d1c1e762ff314177edd09eb, so dask versions are at https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/b20c361ea6a7dfca1d1c1e762ff314177edd09eb/images/requirements.txt#L46-L48 (2021.6.0) and prefect is at https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/b20c361ea6a7dfca1d1c1e762ff314177edd09eb/images/requirements.txt#L93 (0.14.19)

What are your thoughts on next steps?

I think the two next options are one or both of

Try out https://github.com/pangeo-forge/pangeo-forge-recipes/pull/153 on the sleep flow. Should be pretty easy. I can do that next
Restructure the Recipe class to be a bit simpler in what it passes around. Basically the tasks would be top-level functions that just take some arguments, rather than passing the Recipe object around. This is more work.

TomAugspurger commented 3 years ago

I started work on option the two next steps mentioned in https://github.com/pangeo-forge/pangeo-forge-azure-bakery/issues/10#issuecomment-861605331 at this branch: https://github.com/TomAugspurger/pangeo-forge/tree/refactor.

That builds on https://github.com/pangeo-forge/pangeo-forge-recipes/pull/153 adding a to_dask() and to_prefect(). It simplifies the objects that end up in a prefect Task / task graph by moving essentially removing self from all the functions and instead explicitly passes the arguments around.

I'm doing a run right now (using the XarrayToZarr that just sleeps instead of writing). It's halfway through the store_chunk stage, and memory use on the scheduler is steady at 1.9 GB. Since it's running well, I think we're able to confirm that the XarrayToZarr objects were the ones causing the serialization issues.

TomAugspurger commented 3 years ago

@sharkinsspatial I'm picking up this debugging a bit, to verify what fixes the memory issue. My plan is to run a recipe that has the fixed FilePattern avoiding the filepattern_from_sequenc helper, with an input_cache that just sleeps (doesn't actually cache):

class SleepingInputCache(CacheFSSpecTarget):
    def cache_file(self, fname, **fsspec_open_kwargs):
        time.sleep(1)
        return

Then I'll run that with three versions of pangeo-forge-recipes:

pangeo-forge-recipes master
https://github.com/pangeo-forge/pangeo-forge-recipes/pull/153
https://github.com/pangeo-forge/pangeo-forge-recipes/pull/160

I'll post updates here.

TomAugspurger commented 3 years ago

OK, here are the results. For each of these I built a docker image and submitted & ran the flow.

Test	Outcome	Prefect Link (probably not public)	Git Commit
pangeo-forge-recipes master	OOM-Killed scheduler at ~3GB after 60s	https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/36ab5e8a-7605-49a3-bf85-45c58b7e7374	11d605ea0c79b209bb5ba03f6ef6209ee505fb0b
Ryan's PR	OOM-Killed scheduler at ~3GB after 60s	https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/242b7ad7-8932-4172-ac53-a567b605935d	11d605ea0c79b209bb5ba03f6ef6209ee505fb0b
Tom's PR	Stable after 15 minutes / ~20k tasks	https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/cff125b1-2318-4fc6-b797-81129b0eb441	a0576f43e244040518418a3d1afe5e5515b19f78

So tl/dr, https://github.com/pangeo-forge/pangeo-forge-recipes/pull/160 fixes the memory issues on the scheduler, and seems to be necessary.

A small note, workers are building up a bunch of unmanaged memory. That surprises me, since we're just sleeping. This might need more investigation down the road.

rabernat commented 3 years ago

Thanks so much for doing this forensic work Tom! We will go with pangeo-forge/pangeo-forge-recipes#160.

sharkinsspatial commented 3 years ago

Thanks @TomAugspurger. It appears that your PR will solve the scheduler memory growth issues associated with serialization 🎊 . As you noted above we are still seeing incremental memory growth on workers (even without actual activity) as originally noted here. This is problematic with several of our recipes as the worker memory growth over a large number of task executions will result in eventual worker OOM failures (which we were seeing in out initial OISST testing). I'll continue tracking this here and touch base with the Prefect team again to see if they have made any progress on their investigations.

tracetechnical commented 3 years ago

Hi All,

Hopefully the above should be addressed in #21 when it is reviewed and merged.

pangeo-forge / pangeo-forge-azure-bakery

Debugging memory issues #10