Closed rosepearson closed 9 months ago
Have also noticed some unexpected failures with Dask TimeOut/Heartbeat errors. In some instances this is actually after the final chunk appears to have been written to file.
job.err.txt Attached Cylc job.err file illustrating the Dask TimeOut/Heartbeat errors.
@rosepearson is it making the whole code crash? Dask can be a bit verbose with errors during the tearing down procedure of the Dask cluster... although this doesn't cause any real issue (apart from logs reporting errors) in my experience.
@jennan it is making the whole Cylc task crash. I have another one below where we hit the time limit - but again the file had been completely written out. It could be a coincident that the file just finished writing before the time limit was reached and it didn't have time to execute the print "Job Succeeded".. but it seems to me like there might be something slightly fishy. Just attaching as a reference for now - not expecting any action on this ticket right now :)
job.err.txt Attached another Cylc job.err file. Again lots of verbose Task errors. The task ultimately failed after hitting the time limit - it had written out a complete file prior to hitting the time limit, however.
Noting another repeated error that occurs for a particular roughness run. It repeatedly fails when launched on Maui through Cylc, but has completed producing the expected output. e.e. appears to be a false failure. I have checked running on the NIWA maui nodes without SLURM accessed through https://jupyter.maui.niwa.co.nz/ and it runs fine there. This might be a good first place to look into regarding seemingly random errors after the job has run successfully. job.err.txt
tile 2728. roughness stage. takes ~15min.
One more note about the failures I'm experiencing. They are often related to Sending large graph of size XX.XX MiB.
One is copied below. I am wondering if we should restructure how we deal with upsampling the coarse DEM to do that directly instead of breaking into chunks and doing each chunk individually. This would mean many fewer explicit dask delayed calls and leave it up to Dask exactly how it chooses to manage the compute load. This would mean the upsampling is all done with linear interpolation - but that seems sensible anyway.
/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/client.py:3160: UserWarning: Sending large graph of size 12.85 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
warnings.warn(
2023-10-14 11:35:39,162 - distributed.protocol.core - CRITICAL - Failed to Serialize
Traceback (most recent call last):
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/protocol/core.py", line 109, in dumps
frames[0] = msgpack.dumps(msg, default=_encode_default, use_bin_type=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/msgpack/__init__.py", line 38, in packb
return Packer(**kwargs).pack(o)
^^^^^^^^^^^^^^^^^^^^^^^^
File "msgpack/_packer.pyx", line 294, in msgpack._cmsgpack.Packer.pack
File "msgpack/_packer.pyx", line 300, in msgpack._cmsgpack.Packer.pack
File "msgpack/_packer.pyx", line 297, in msgpack._cmsgpack.Packer.pack
File "msgpack/_packer.pyx", line 264, in msgpack._cmsgpack.Packer._pack
File "msgpack/_packer.pyx", line 231, in msgpack._cmsgpack.Packer._pack
File "msgpack/_packer.pyx", line 264, in msgpack._cmsgpack.Packer._pack
File "msgpack/_packer.pyx", line 272, in msgpack._cmsgpack.Packer._pack
ValueError: memoryview is too large
2023-10-14 11:35:39,163 - distributed.comm.utils - ERROR - memoryview is too large
Looking back through my notes/comments:
ValueError: memoryview is too large
error occurs for the waterway stage of tiles 3128, 3130, 3131, 3532ValueError: 3615135876 exceeds max_bin_len(2147483647)
error occurs in the waterway stage for 3431I've also been thinking about how I break up the course DEM stage into a bunch of explicit chunks and wonder if that would be best to leave up to Dask.
Tile 2627 roughness
Most recent commit associated with this comment https://github.com/rosepearson/GeoFabrics/pull/217/commits/ae30061c64b9c2ff6e93e87d16427519cb58ffa4 Have made some of the agreed changes:
I've been looking into xarray.interp with chunking. I've come across the following cases:
I've also tried various ways to force/encourage dask in the interp call:
coarse_dem = coarse_dem.interp(x=dask.array.from_array(self._dem.x, 1000), y=dask.array.from_array(self._dem.y), method="linear")
coarse_dem.chunk(1000).interp(x=dask.array.self._dem.x, y=self._dem.y, method="linear")
I tried with both the normal chunk size and also one that was 8x smaller given the coarse resolution is 8x bigger.
It started saving out a file, but did not succeed in completely writing the file.
@jennan thanks for your notebook. I've implemented it with a few minor tweaks to deal with the std map layout (yx with y decreasing). Also a seemingly odd issue where the chunking needs dask.array.map_blocks needs the second array to have a chunk of its length or smaller - odd at the first array is autmatically given a chunk of its size if the specified chunk size is greater.
No action required - just an update
Test profiling for tests/test_dem_generation_westport_4/test_dem_generation_westport_4.py top chunking to 10 - gives 4x5 chunks (1m54s) where the bottom xhunking to 300 - gives 1x1 chunks (1m15s)
It ran successfully produced the netCDF file which is new. It did fail with a TimeOut
error shortly afterwards.
2023-11-10 02:59:37,360 - distributed.nanny - WARNING - Worker process still alive after 3.199999694824219 seconds, killing
2023-11-10 02:59:38,137 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x2aad57c94d90>>, <Task finished name='Task-180509187' coro=<SpecCluster._correct_state_internal() done, defined at /nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/deploy/spec.py:346> exception=TimeoutError()>)
Traceback (most recent call last):
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/utils.py", line 1922, in wait_for
return await fut
^^^^^^^^^
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/tornado/ioloop.py", line 738, in _run_callback
ret = callback()
^^^^^^^^^^
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/tornado/ioloop.py", line 762, in _discard_future_result
future.result()
TimeoutError
Traceback (most recent call last):
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/utils.py", line 1922, in wait_for
return await fut
^^^^^^^^^
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/scale_wlg_persistent/filesets/project/niwa03440/geofabrics/GeoFabrics/src/geofabrics/__main__.py", line 53, in <module>
cli_run_from_file()
File "/scale_wlg_persistent/filesets/project/niwa03440/geofabrics/GeoFabrics/src/geofabrics/__main__.py", line 47, in cli_run_from_file
runner.from_instructions_file(instructions_path=args.instructions)
File "/scale_wlg_persistent/filesets/project/niwa03440/geofabrics/GeoFabrics/src/geofabrics/runner.py", line 211, in from_instructions_file
from_instructions_dict(instructions=instructions)
File "/scale_wlg_persistent/filesets/project/niwa03440/geofabrics/GeoFabrics/src/geofabrics/runner.py", line 153, in from_instructions_dict
run_processor_class(
File "/scale_wlg_persistent/filesets/project/niwa03440/geofabrics/GeoFabrics/src/geofabrics/runner.py", line 48, in run_processor_class
runner.run()
File "/scale_wlg_persistent/filesets/project/niwa03440/geofabrics/GeoFabrics/src/geofabrics/processor.py", line 2999, in run
dem = self.create_dem(waterways=waterways)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scale_wlg_persistent/filesets/project/niwa03440/geofabrics/GeoFabrics/src/geofabrics/processor.py", line 2861, in create_dem
runner.run()
File "/scale_wlg_persistent/filesets/project/niwa03440/geofabrics/GeoFabrics/src/geofabrics/processor.py", line 910, in run
with cluster, distributed.Client(cluster) as client:
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/deploy/cluster.py", line 540, in __exit__
aw = self.close()
^^^^^^^^^^^^
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/deploy/spec.py", line 293, in close
aw = super().close(timeout)
^^^^^^^^^^^^^^^^^^^^^^
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/deploy/cluster.py", line 226, in close
return self.sync(self._close, callback_timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/utils.py", line 359, in sync
return sync(
^^^^^
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/utils.py", line 426, in sync
raise exc.with_traceback(tb)
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/utils.py", line 399, in f
result = yield future
^^^^^^^^^^^^
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/tornado/gen.py", line 767, in run
value = future.result()
^^^^^^^^^^^^^^^
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/deploy/spec.py", line 446, in _close
await self._correct_state()
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/deploy/spec.py", line 359, in _correct_state_internal
await asyncio.gather(*tasks)
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/nanny.py", line 595, in close
await self.kill(timeout=timeout, reason=reason)
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/nanny.py", line 380, in kill
await self.process.kill(reason=reason, timeout=0.8 * (deadline - time()))
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/nanny.py", line 843, in kill
await process.join(max(0, deadline - time()))
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/process.py", line 330, in join
await wait_for(asyncio.shield(self._exit_future), timeout)
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/site-packages/distributed/utils.py", line 1921, in wait_for
async with asyncio.timeout(timeout):
File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.11/asyncio/timeouts.py", line 111, in __aexit__
raise TimeoutError from exc_val
TimeoutError
dashboard during LiDAR write dashboard at end of execution
export DASK_DISTRIBUTED__COMM__RETRY__COUNT=3
Dask
folder is created - dask-worker-space
- found in tmp\dask-worker-space
processor.py
to nest context for clusters and then client. Tried - still an error.processor.py
cluster variable with dashboard_address: None
. Tried - still an error.processor.py
cluster variable with "local_directory": "path_to_folder_with_results
. This produces an output in the folder - a global.lock & purge.lock file created and not deleted prior to crash.
This is an issue for an upcoming NeSI consultancy with @jennan. The focus will be on improving the performance and stability of GeoFabrics for larger scale problems.
Focus on better making use of Dask throughout the GeoFabrics stages. Two identified areas are:
_RasterArray.interpolatena Profiling has shown that pinch points are quite different for larger scale problems than smaller problems. Take the two profiles below.
is a 6min problem with all geofabric stages. (small_2m_res.html)
is a 4hr problem with all geograbric stages. The only difference with 1 is it is 1m instead of 2m resolution. (small_1m_res.html)
RasterArray.clip Another area of focus (although it hasn't showed up as an issue in the 1m profiling it is vivible in the 2m) is Could make use of either pandas or dask-geopandas. A weak attempt that I didn't get off the ground can be seen as a comment in processor.py
Also worth noting that we may be able to do this more directly using a rolling.min call to the xarray with an appropriate size window.