Closed cisaacstern closed 2 years ago
@cisaacstern I'm on vacation and travelling today but I'll be back in the office on Tuesday and Wednesday if you would like to pair on log filtering. The recipes at scale issues that we have encountered can be broadly categorized into 2 areas, graph serialization issues related to graph size and fsspec
/hdf5
concurrency issues. We can begin by investigating these areas first.
@sharkinsspatial pasting and replying to your question in https://github.com/pangeo-forge/pangeo-forge.org/issues/63#issuecomment-1101629567 here:
@cisaacstern Is the Prefect dashboard for this run publicly accessible? This is often helpful in diagnosing fsspec concurrency issues as these deadlocks don't appear as failures but tasks stuck in a pending state. The Prefect dashboard provides a mechanism for filtering on task state to easily identify these deadlocks.
The dashboard is not publicly accessible. Filtering the logs there by level ERROR
, we get a few different traceback types:
fsspec.exceptions.FSTimeoutError
distributed.comm.core.CommClosedError
OSError: Timed out trying to connect
My guess is that fsspec.exceptions.FSTimeoutError
(which arose on the first store_chunk
task) is the root problem, and that the others are symptoms of that? As for why that error occurred, I'm not clear on that yet. Certainly welcome any and all suggestions.
As shown in the details dropdown for fsspec.exceptions.FSTimeoutError
above, this error was caused by
aiobotocore.response.AioReadTimeoutError: Read timeout on endpoint URL: "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr/analysis_error/0.0.0"
The same error also appears in the Prefect logs for another url (which I had not previously pasted to this thread):
aiobotocore.response.AioReadTimeoutError: Read timeout on endpoint URL: "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr/analysed_sst/15.0.0"
As seen in the traceback pasted in https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/issues/2#issuecomment-1101651101, this is trigged by zarr_array[zarr_region] = data
# pangeo_forge_recipes.recipes.xarray_zarr
target_mapper = config.storage_config.target.get_mapper()
# etc
zgroup = zarr.open_group(target_mapper)
for vname, var_coded in ds_chunk.variables.items():
zarr_array = zgroup[vname]
# etc
zarr_array[zarr_region] = data
Based on the aiobotocore
error, this appears to be related to https://stackoverflow.com/questions/53974790, in which @martindurant recommended decreasing block_size
to mitigate the affect of slow network speeds. Notes on this:
default_block_size
of 5 MB, which seems modest to begin with.FSSpecTarget
. And relatedly, because this I/O is mediated by zarr
, I admit I'm a bit unclear as to what file system is actually handling it. Martin, questions for you:
FSSpecTarget
's S3FileSystem
, from which the target_mapper
is assigned by the line
target_mapper = config.storage_config.target.get_mapper()
handling I/O for the zgroup
assigned as
zgroup = zarr.open_group(target_mapper)
?
default_block_size
on this S3FileSystem
a sensible next step? It would seem to me like a GCS cluster in us-central1
should be able to read 15-28 MB files from the OSN S3 bucket (also in central, IIUC) in 5 MB blocks without hitting the default timeout of 5 minutes?This feels like an OSN error. Perhaps there was a glitch on their end. Two thoughts:
Can you try just running the recipe again? (If so, would it reuse the same cache location so it doesn't have to re-download all the files?)
Yes, the cache would be re-used. I'm not optimistic that a re-run will get us past this, given that this recipe was actually already run once before, and it encountered the same error the first time. The first run was the one linked in https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/issues/1#issue-1205106780. As described in https://github.com/pangeo-forge/pangeo-forge.org/issues/63, the pangeo-forge.org logs are not helpful there, but digging into the Prefect backend reveals the same store_chunk
➡️ zarr_array[zarr_region] = data
➡️ fsspec.exceptions.FSTimeoutError
that we're discussing here.
Are we using retries on these tasks?
I'm actually unclear on this. My assumption would be no, we are not using retries. I base this on the fact that I don't see evidence of them in the logs for the runs under consideration. And, even more convincingly, that the xarray_zarr_recipe_compiler
does not pass any annotations
, which is how retries are defined by the prefect compiler.
That being said, in a separate run, I did observe what looked like retries of prepare_target
, described in https://github.com/pangeo-forge/staged-recipes/pull/127#issuecomment-1076893854. I'm not sure where these retries were configured, but they don't seem to have been employed for the failing store_chunk
tasks we're currently discussing.
Seems like a good next step would be supporting configuration of retries for the xarray_zarr_recipe_compiler
? The easiest way to make this work for the registrar would be:
pangeo-forge-recipes
pangeo-forge-recipes
0.8.3
pangeo-forge-recipes==0.8.3
0.8.3
releasepangeo_forge_version
in this feedstock's meta.yaml
to 0.8.3
; pushing this change to main
on this feedstock will trigger a new recipe runLet me know if you think it would make sense to reach out the the OSN pod maintainers to do further debugging.
Thanks let's keep that in mind as a next recourse if retries don't help. Is it possible there are somehow... concurrency limits on the bucket?
@cisaacstern Previously I had been approaching retry of failed tasks at the Prefect task
level during registration after compilation but it seems as if this configuration has been shifted directly to pangeo-forge-recipes
via annotations.
Where do we feel this configuration logic should live and how should this stage annotation be created and managed for different compilation strategies?
Additionally, when using Prefect as the compilation target we may want to also consider including retry_delay
for the task args to avoid overrunning S3 requests. https://docs.prefect.io/api/latest/core/task.html#task-2
@sharkinsspatial, thanks so much for chiming in. It's very useful to know that you'd been using retries (with delays!) successfully in your prior work.
Where do we feel this configuration logic should live and how should this stage annotation be created and managed for different compilation strategies?
This is a very good question which I imagine will require a slightly longer iteration cycle for us to get right.
To begin, I'm going to try just patching these retry + delay kwargs into a pangeo-forge-recipes
branch, and then re-running this recipe with that branch, just so we can see if this indeed resolves our problem. I'll report back on this thread with results. If this works, we can circle back and consider the best way to specify this config.
After a series of steps reflected in the linked PRs above, I was able to re-deploy this feedstock with pangeo-forge-recipes
installed from a branch with this hardcoded Prefect retry logic, referenced from Sean's links in https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/issues/2#issuecomment-1102931123.
We are still failing (and the linked recipe run page logs are still unhelpful without filtering), but perhaps this could be construed as progress in that I'm not seeing the fsspec.exceptions.FSTimeoutError
in the Prefect logs. The main error we're now hitting appears to be a new flavor of the distributed.comm.core.CommClosedError
reported in https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/issues/2#issuecomment-1101651101:
ConnectionResetError: [Errno 104] Connection reset by peer
Just to note, this recipe does generate 7134 store_chunk
tasks:
len(list(recipe.iter_chunks())) # --> 7134
@sharkinsspatial you mentioned "overrunning S3 requests" above. Does this many concurrent writes seem like it might cause that problem? If so, is our best solution limiting concurrency?
@cisaacstern Given the variable partition prefixes where you are writing data and the volume of concurrent requests (related to the number of workers in your cluster) I don't believe you should be experiencing any S3 request throttling.
Seeing only ConnectionResetError
in your logs would lead me to believe that the worker running store_chunk[40]
was killed. I'm assuming that the filter you are using with Grafana includes both the scheduler and worker logs if you are using this approach.
My initial hunch/guess is that your worker pod is being evicted/restarted https://distributed.dask.org/en/stable/killed.html by k8s due to OOM. We can use kubectl
or Lens
to view the cluster health information in more detail and see if this is indeed the case.
Thanks to Sean for a helpful offline sync. Recapping our discussion:
e2-standard-4
instances appear to allow a maximum of 16 GB memory (not the 4 GB we'd previously thought)I'll bump up scheduler memory now, and then re-deploy the recipe, and we'll see what happens.
I noticed that the recipe is using locking, and that is where the timeout happens. That doesn't make sense to me, given the recipe itself - https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/blob/main/feedstock/recipe.py. No locks should be required.
IIUC, we use locks whenever we're in a distributed context, which... we are on Prefect?
But there should not be any actual conflicts for this type of recipe
Ah I think I understand! It's because we specify {'time': 2}
in target chunks
But only 1 item per file
We should add inputs_per_chunk=2
to the recipe constructor.
We should add inputs_per_chunk=2 to the recipe constructor.
I'll do this now, before increasing scheduler memory, to see if that resolves the problem.
To be clear, the locking should also work. But this might unblock us for now.
Merging #4 triggered a new deployment linked to recipe-run/36, which is currently in progress.
Adding inputs_per_chunk=2
has cut the number of store_chunk
tasks in half
len(list(recipe.iter_chunks())) # --> now returns 3567
whereas it was previously 7134
. This may be in our favor.
I was optimistic about this last deployment, because we made it through around 2000 of the ~3500 store_chunk
tasks without failure. Then the run failed 😐 . Here are as much of the logs as I'm able to pull from Grafana (a 1000 line limit is being imposed somewhere, that I don't understand how to increase):
https://gist.github.com/cisaacstern/7c3245720efc4da8e2b3cafdaaeca57a
On store_chunk[1852]
we see a distributed.worker - WARNING - Heartbeat to scheduler failed
. I am still learning how to conclusively confirm that the the scheduler died due to OOM (there appear to be ways to do this at the k8 level that I haven't yet fully grokked). In the absence of conclusive evidence, this warning does appear to suggest the type of scheduler memory issues mentioned in https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/issues/2#issuecomment-1104134323.
While throwing memory at the scheduler is not a great long-term solution, in the absence of an upstream fix in Prefect, I'm curious what increasing scheduler memory may do here. As an experiment, I will try re-running this with increased scheduler memory, which we can always dial back.
Conceivably, the reduced task graph size resulting from inputs_per_chunk
allowed us to get further than we had in the past, but it's possible we were still under-provisioned on memory. This upcoming experiment should be illuminating.
As recorded in the growing series of failed deployments, this problem was not solved by increasing scheduler memory alone.
Most recently:
OSError: Timed out trying to connect
Assuming tcp://dask-jovyan-dd7e32e7-9.pangeo-forge-columbia-staging-bakery:8786
is the scheduler (there is definitely a way to confirm this with the k8s tooling that I haven't yet grasped), then this is looking very convincingly like the previously referenced graph size / memory leak problem being tracked in https://github.com/pangeo-forge/pangeo-forge-recipes/issues/208.
So I have a workaround idea which is not too hacky. Right now we're generating ~3000 store_chunk
tasks which create ~100 MB chunks. If we double the chunk size to 200 MB, we can halve the number of store_chunk
tasks, and get a smaller graph which possibly can execute without failing. I'm going to try that now.
Reducing the number of store_chunk
tasks to ~1500 appears to have resulted in a successful deployment.
@rwegener2, does this dataset look as you'd expect it to:
import fsspec
import xarray as xr
dataset_public_url = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr'
mapper = fsspec.get_mapper(dataset_public_url)
ds = xr.open_zarr(mapper, consolidated=True)
ds
?
On a technical note, I'm pretty confused as to why the chunks in this dataset are only of length 2
in the time dimension:
When the latest commit https://github.com/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/commit/32ba8c8f6a639975a1061ece699ac2f053cb8d02 clearly specifies 4
time steps per chunk:
And this commit is the Head SHA specified in the recipe run for this deployment.
Amazing! Great progress Charles!
On a technical note, I'm pretty confused as to why the chunks in this dataset are only of length
2
in the time dimension:
This certainly seems inconsistent. Are you sure you weren't pointing it at an older Zarr store that already had chunks initialized?
For debugging purposes, it can be useful to open the store with Zarr directly (rather than Xarray).
Are you sure you weren't pointing it at an older Zarr store that already had chunks initialized?
Oh good call, this is a distinct possibility.
For debugging purposes, it can be useful to open the store with Zarr directly (rather than Xarray).
Interesting. What does zarr tell us that xarray does not?
I'm going to close this because I believe we've established that the diagnosis is the above-linked Prefect / Dask graph size issue.
Wow, @cisaacstern, this is great!
The output data looks normal to me. I agree that the chunking is not what I expected, but I'm fine with it the way it is.
Thanks so much for all your work on this!
Our first deployment of this feedstock failed
despite the fact that
The fact that the recipe succeeded in pruned form, and that we know the deployment logic is sound, suggests that the problem is one of scale (something like hitting concurrency limits, etc.).
The recipe run page for this failed run is linked via the
Deployed
text on the deployments page linked aboveThe logs there are not especially informative (I will raise a separate issue about filtering on the website repo). I will do some digging on the Prefect/Grafana backend to surface some more informative logs and report back here. @sharkinsspatial, I may lean on your experience running recipes at scale, which I expect may help us get to bottom of this.
cc @rwegener2 @rabernat