Open TomAugspurger opened 3 years ago
With debug logs:
pangeo_forge_recipes.recipes.xarray_zarr - DEBUG - Acquiring locks ['time-23']
pangeo_forge_recipes.utils - DEBUG - Acquiring lock pangeo-forge-time-23...
pangeo_forge_recipes.utils - DEBUG - Acquired lock pangeo-forge-time-23
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable time chunk (23,) to Zarr region (slice(460, 480, None),)
pangeo_forge_recipes.utils - DEBUG - Released lock pangeo-forge-time-23
pangeo_forge_recipes.recipes.xarray_zarr - DEBUG - Acquiring locks ['sst-23']
pangeo_forge_recipes.utils - DEBUG - Acquiring lock pangeo-forge-sst-23...
So AFAICT, the lock sst-23
was released prior that worker trying to acquire it.
~Debugging, the lock is in fact available:~ (this is wrong: see https://github.com/pangeo-forge/pangeo-forge-azure-bakery/issues/10#issuecomment-858639754)
# shell 1
$ kubectl port-forward -n pangeo-forge dask-root-b4bde08a-3rdfkz 8786
# shell 2
$ python
Python 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import distributed
>>> client = distributed.Client("tcp://127.0.0.1:8786")
>>> lock = distributed.Lock("sst-23")
>>> lock.acquire(timeout=5)
True
>>> lock.release()
>>>
I'd love to see what's going on on that worker... I think kubernetes lets you ssh into pods.
@TomAugspurger As a note all of the tests in https://github.com/pangeo-forge/pangeo-forge-recipes/issues/151 were run using pangeo/pangeo-notebook:2021.05.04
as a base image with the following modifications.
xarray==0.18.0
pangeo-forge-recipes==0.3.4
prefect[aws, azure, google, kubernetes]==0.14.7
We did not experience any hanging worker issues beyond those outlined in https://github.com/pangeo-forge/pangeo-forge-recipes/issues/144 so I'm unsure if this is azure blob storage related or perhaps due to a change in the pangeo-forge-recipes
main
branch. Do you have a branch which shows the mapper implementation you were using?
@ciaranevans Can we investigate how to expose the external network interface for the worker pods and they key we'll need to set so Tom can ssh into a hung worker pod?
I maybe got it with kubectl exec -it -n pangeo-forge dask-root-b4bde08a-32rpqk -- /bin/bash
The terminal was hanging though.
I did another run and this time the lock
actually seemed to be claimed. I'm seeing if we can get a bit better information into distributed's locks, to figure out who has what.
@TomAugspurger Had that pod died by the time you tried to connect?
It's not the case of Prefect/K8s killing it as soon as it errors?
Nope, it was still alive. The only pods I've seen killed are (I think) due to https://github.com/pangeo-forge/pangeo-forge-azure-bakery/issues/11.
Hmm okay, I'm unsure how kubectl and getting on a running container works tbh. If it can't get a connection, does it hang or will it error
I did another run and this time the lock actually seemed to be claimed. I'm seeing if we can get a bit better information into distributed's locks, to figure out who has what.
I believe I was mistaken when I initially said that the lock was actually able to be acquired in https://github.com/pangeo-forge/pangeo-forge-azure-bakery/issues/10#issuecomment-858119458. Pangeo-forge prepends the key pangeo-forge
to the locks it acquires, so I should have acquired pangeo-forge-sst-23
rather than just sst-23
.
My guess right now is that some other worker failed to release that lock. I'll try to confirm that.
I just did a run with no locks (by commenting out the locking in https://github.com/TomAugspurger/pangeo-forge/commit/f22e73c95ae6ddbfe52bbaa89ac37732b364dea5) and it completed successfully. I think this recipe doesn't have any conflicting locks, but I need to confirm that . So a few things:
LockExtension
object could be updated to use the client.id
of the client that acquires the lock, and lock acquisitions / releases should be logged to the /events
dashboard.
~2. I didn't observe any memory issues. This might be because we're using the latest versions of adlfs, dask, distributed, fsspec, and pangeo-forge-recipes. Or it might be because we're using adlfs rather than s3fs. @sharkinsspatial can you confirm: did you ever see memory issues with the Azure bakery (I think not, since you were hitting the adlfs / fsspec caching issue)?~ Sorry, I forgot I made the date_range
smaller for debugging. I'll try with the full range now.Locking should only be happening if it is actually needed for concurrent writes to the same chunk.
AFAIK, none of our usual test cases need any locking.
@TomAugspurger If you would like a simpler example to debug worker memory leaks you can use https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/dask_memory_flow/flow_test/dask_memory_flow.py so that we can isolate this issue from any interactions with other dependencies.
FYI @sharkinsspatial I tried that dask_memory_flow.py flow and did not see any issues. Notes below
The scheduler pod is currently being killed for OOM (0d751e59ab38c6fc8906ec15e7e78d5fc76892dc, https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/50d52570-449c-4083-9fd3-cbc7efa0ea91)
The dask_memory_flow
, which just maps a bunch of sleeps, looks pretty fine.
https://cloud.prefect.io/flow-run/aab25c7f-093d-498a-9266-1d5f5560a0a4 /
https://github.com/pangeo-forge/pangeo-forge-azure-bakery/commit/b20c361ea6a7dfca1d1c1e762ff314177edd09eb
Memory use on the scheduler is essentially flat. Small jumps (2-3 MB) when a
group of workers is added. Small increase per "tick" of 0.08MB. I wonder if
that stabilizes.
Memory use climbed to ~380 - 385MiB by the end.
So the summary is that prefect seems to be doing OK with that number of tasks. I followed that up with a test that subclassed XarrayToZarr
and override prepare_target
and store_chunk
to just time.sleep
. The goal was to verify that it's the pangeo-forge recipe object causing issues with scheduler memory, rather than actual execution itself. With that setup, the scheduler memory jumped to 4.5 GiB within 25s, before being OOM-killed by kubernetes.
@TomAugspurger 👍 Thanks for the investigation. Can you report the distributed
, dask
and prefect
versions on the worker image used for your dask_memory_flow
test? I'll run another AWS test using this version combination so that we can verify if this worker memory issue is specific to the dask-cloudprovider
ECSCluster.
Based on your subclass
experiment it appears that recipe serialization may be the culprit in our scheduler memory woes. What are your thoughts on next steps? If you want to continue to coordinate with @rabernat on testing and potential solutions for this, I'll work with the Prefect team to try and diagnose our AWS worker memory issues this week.
I'll verify, but this was using commit b20c361ea6a7dfca1d1c1e762ff314177edd09eb, so dask versions are at https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/b20c361ea6a7dfca1d1c1e762ff314177edd09eb/images/requirements.txt#L46-L48 (2021.6.0) and prefect is at https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/b20c361ea6a7dfca1d1c1e762ff314177edd09eb/images/requirements.txt#L93 (0.14.19)
What are your thoughts on next steps?
I think the two next options are one or both of
I started work on option the two next steps mentioned in https://github.com/pangeo-forge/pangeo-forge-azure-bakery/issues/10#issuecomment-861605331 at this branch: https://github.com/TomAugspurger/pangeo-forge/tree/refactor.
That builds on https://github.com/pangeo-forge/pangeo-forge-recipes/pull/153 adding a to_dask()
and to_prefect()
. It simplifies the objects that end up in a prefect Task / task graph by moving essentially removing self
from all the functions and instead explicitly passes the arguments around.
I'm doing a run right now (using the XarrayToZarr that just sleeps instead of writing). It's halfway through the store_chunk
stage, and memory use on the scheduler is steady at 1.9 GB. Since it's running well, I think we're able to confirm that the XarrayToZarr
objects were the ones causing the serialization issues.
@sharkinsspatial I'm picking up this debugging a bit, to verify what fixes the memory issue. My plan is to run a recipe that has the fixed FilePattern
avoiding the filepattern_from_sequenc
helper, with an input_cache
that just sleeps (doesn't actually cache):
class SleepingInputCache(CacheFSSpecTarget):
def cache_file(self, fname, **fsspec_open_kwargs):
time.sleep(1)
return
Then I'll run that with three versions of pangeo-forge-recipes:
I'll post updates here.
OK, here are the results. For each of these I built a docker image and submitted & ran the flow.
Test | Outcome | Prefect Link (probably not public) | Git Commit |
---|---|---|---|
pangeo-forge-recipes master | OOM-Killed scheduler at ~3GB after 60s | https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/36ab5e8a-7605-49a3-bf85-45c58b7e7374 | 11d605ea0c79b209bb5ba03f6ef6209ee505fb0b |
Ryan's PR | OOM-Killed scheduler at ~3GB after 60s | https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/242b7ad7-8932-4172-ac53-a567b605935d | 11d605ea0c79b209bb5ba03f6ef6209ee505fb0b |
Tom's PR | Stable after 15 minutes / ~20k tasks | https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/cff125b1-2318-4fc6-b797-81129b0eb441 | a0576f43e244040518418a3d1afe5e5515b19f78 |
So tl/dr, https://github.com/pangeo-forge/pangeo-forge-recipes/pull/160 fixes the memory issues on the scheduler, and seems to be necessary.
A small note, workers are building up a bunch of unmanaged memory. That surprises me, since we're just sleeping. This might need more investigation down the road.
Thanks so much for doing this forensic work Tom! We will go with pangeo-forge/pangeo-forge-recipes#160.
Thanks @TomAugspurger. It appears that your PR will solve the scheduler memory growth issues associated with serialization 🎊 . As you noted above we are still seeing incremental memory growth on workers (even without actual activity) as originally noted here. This is problematic with several of our recipes as the worker memory growth over a large number of task executions will result in eventual worker OOM failures (which we were seeing in out initial OISST testing). I'll continue tracking this here and touch base with the Prefect team again to see if they have made any progress on their investigations.
Hi All,
Hopefully the above should be addressed in #21 when it is reviewed and merged.
@sharkinsspatial dumping some notes below. Let me know if you want to jump on a call to discuss. I'm current seeing workers hanging as they try to acquire a lock in pangeo-forge. I'll see what I can do to debug.
Summarizing the issues we're seeing with pangeo-forge.
** Changes to environment
Updated to latest released Dask, distributed, fsspec, adlfs. Installed pangeo-forge-recipes from GitHub.
** Cloud build
If you want to build the images on Azure, avoid upload. Not a huge benefit, if we have to download them to submit.
** Access Dask Dashboard
One of the
dask-root
pods prefect starts is the scheduler pod.** Add logs to worker handler
This makes the logs accessible from the Dask UI. I'm sure there's a better way to do this.
** Hanging Workers
I'm seeing some hanging, but perhaps different from what others saw. The worker logs say
Looking at the call stack
So seems like the issue is in
lock_for_conflicts
. Either a real deadlock, or an event loop issue.