rosepearson / GeoFabrics

A package for generating hydrologically conditioned DEMs and roughness maps from LiDAR and other infrastructure data. Check the wiki for install and usage instructions, and documentation at https://rosepearson.github.io/GeoFabrics/
GNU General Public License v3.0
27 stars 11 forks source link

Dask lazy compute to reduce memory overhead #48

Closed rosepearson closed 1 year ago

rosepearson commented 2 years ago

Currently compute is called in the dem module after the chunked dem is created. See code screen capture. image We could reduce memory load by not calling compute and instead calling dense_dem.to_netcdf(...). This would mean instead of all of the dense_dem contents being loaded into memory in one go, chunks could be loaded, processed and saved individually reducing the overall memory footprint (particularly important for large catchments for fine resolutions).

Considerations - saving the dense DEM before generating the offshore DEM values.

Areas to adderss from @rosepearson:

rosepearson commented 2 years ago

Notes - I quickly looked into doing this and had some odd behaviour when it seemed to load some of the LiDAR files many times and took much longer than the explicit compute approach.

INFO:root:The output the coordinate system EPSG values of {'horizontal': 2193, 'vertical': 7839} will be used. If these are not as expected. Check both the 'horizontal' and 'vertical' values are specified.
INFO:root:Downloading vector layers [51153] from the linz dataservice
WARNING:fiona._env:One or several characters couldn't be converted correctly from UTF-8 to ISO-8859-1.  This warning will not be emitted anymore.
INFO:root:The LiDAR dataset Wellington_2013 is assumed to have the source coordinate system EPSG: {'horizontal': 2193, 'vertical': 7839} as defined in the instruction file
INFO:root:Preparing [2, 2] chunks
INFO:root:  Chunk [0, 0]
INFO:root:  Chunk [0, 1]
INFO:root:  Chunk [1, 0]
INFO:root:  Chunk [1, 1]
INFO:root:Reading all 6 files in chunk.
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089041.laz
INFO:root:Reading all 8 files in chunk.
INFO:root:Reading all 4 files in chunk.
INFO:root:Reading all 6 files in chunk.
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089041.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091041.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_088041.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_088038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_088039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090041.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_088040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_088040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091039.laz
INFO:root:Incorporting Bathymetry: ['C:/Users/pearsonra/Documents/data/Bathymetry/Waikanae/lds-depth-contour-polyline-hydro-190k-1350k-SHP.zip!depth-contour-polyline-hydro-190k-1350k.shp']
INFO:root:Reading all 8 files in chunk.
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091039.laz
INFO:root:Reading all 8 files in chunk.
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091039.laz
INFO:root:Reading all 8 files in chunk.
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091039.laz
INFO:root:Reading all 8 files in chunk.
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091039.laz
INFO:root:Creating offshore interpolant
INFO:root:Reading all 8 files in chunk.
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091039.laz
INFO:root:Reading all 8 files in chunk.
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090038.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_089039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090040.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_090039.laz
INFO:root:   Loading in file ot_CL1_WLG_2013_1km_091039.laz
rosepearson commented 1 year ago

@jennan has agreed to provide guidance on this issue. Depending on what we find there are various possibilities on how to proceed.

The first things to do, however, are:

  1. @rosepearson to define the scale of the problem - size of medium and large problem at 10m, 5m and 2.5m resolution
    • Scale of the 'large' problem is 10GB for final XArray (e.g. the Clutha River), and less than 1GB for the 'medium' problem (e.g. the Buller River).
    • We do have the potential for a 100GB array through that is almost all (90%) masked NaN when estimating waterways.
    • Also discussed with @CyprienBosserelle and will move to float32 by default - will do this prior to our meeting.
  2. Together look at a medium scale problem on the HPC - explore the impact of moving to lazy compute and adjusting LocalCluster settings with Dask dashboarding
rosepearson commented 1 year ago

Caught up with @jennan, who showed off the powers of the Dask profiler. Tasks for @rosepearson to do before the next meeting:

Based on the large problem we may change the Dask workers configuration and then address lazy compute.

rosepearson commented 1 year ago

Weirdly I get the following error when trying to load the Miniconda3 module image

jennan commented 1 year ago

Hi @rosepearson, on wsg001 (and any Maui ancil. node), you need to load the NeSI module after module purge, because this one unload everything (it doesn't unload NeSI module on Mahuika):

module purge
module load NeSI
module load Miniconda3
rosepearson commented 1 year ago

Thanks @jennan - I wonder if you might want to update the documentation on https://support.nesi.org.nz/hc/en-gb/articles/360001580415-Miniconda3 with a note that you may need to load the NeSI module if you are on Maui ancil (unless access to these is limited to NIWA only... which I'm not thinking it might be). image

jennan commented 1 year ago

@rosepearson this node is limited to NIWA indeed.

jennan commented 1 year ago

@rosepearson Actually, this is good to mention for NeSI ancil node (which wsg001 is not ;-)), you are right. I'll update the documentation.

rosepearson commented 1 year ago

Looking at running a large example over Wellington

  1. First a medium example for the same data and configuration - just over a smaller region image image This used a chunk size of 100, and a memory limit of 10GiB

  2. Over the large example with the same data and configuration - a larger region I get stuck with an unresponsive dashboard. i wonder if this is because we still have explicit compute so perhaps it is trying to allocate too much space in memory for the full DEM? image I also get the following error heaps image used chunk sizes of:

    • 100, 15GiB memory limit - behaviour shown above.
    • 1000, 15GiB memory limit - no errors about garbage collection + a responsive dashboard image resulted in some killed workers so try a smaller chunk size image
    • 500, 15GiB memory limit - much slower to load the dashboard and some garbage collection errors. image A few killed. I'll try increase the memory limit. image image
    • 500, 20GiB memory limit Still killed workers. I'll try a smaller chunk size image
    • 250, 20GiB memory limit image sob image
jennan commented 1 year ago

@rosepearson it is likely an issue with the number of Dask tasks. If so... you can try to

rosepearson commented 1 year ago

Finally got a crash @jennan on the slurm job. I've copied the text below for your interest (from the SLURM job out file). I'll comment where when I have a repo for you to use for you to run the code yourself.

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 819, in kill
    await process.join(max(0, deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/process.py", line 316, in join
    await asyncio.wait_for(asyncio.shield(self._exit_future), timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-10-18 11:57:46,662 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 819, in kill
    await process.join(max(0, deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/process.py", line 316, in join
    await asyncio.wait_for(asyncio.shield(self._exit_future), timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-10-18 11:57:46,663 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 819, in kill
    await process.join(max(0, deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/process.py", line 316, in join
    await asyncio.wait_for(asyncio.shield(self._exit_future), timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-10-18 11:57:46,665 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 819, in kill
    await process.join(max(0, deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/process.py", line 316, in join
    await asyncio.wait_for(asyncio.shield(self._exit_future), timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-10-18 11:57:46,665 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 819, in kill
    await process.join(max(0, deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/process.py", line 316, in join
    await asyncio.wait_for(asyncio.shield(self._exit_future), timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-10-18 11:57:46,666 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 819, in kill
    await process.join(max(0, deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/process.py", line 316, in join
    await asyncio.wait_for(asyncio.shield(self._exit_future), timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-10-18 11:57:46,666 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 819, in kill
    await process.join(max(0, deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/process.py", line 316, in join
    await asyncio.wait_for(asyncio.shield(self._exit_future), timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-10-18 11:57:46,667 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 819, in kill
    await process.join(max(0, deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/process.py", line 316, in join
    await asyncio.wait_for(asyncio.shield(self._exit_future), timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-10-18 11:57:46,668 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 819, in kill
    await process.join(max(0, deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/process.py", line 316, in join
    await asyncio.wait_for(asyncio.shield(self._exit_future), timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-10-18 11:57:46,669 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 819, in kill
    await process.join(max(0, deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/process.py", line 316, in join
    await asyncio.wait_for(asyncio.shield(self._exit_future), timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-10-18 11:57:46,670 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 819, in kill
    await process.join(max(0, deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/process.py", line 316, in join
    await asyncio.wait_for(asyncio.shield(self._exit_future), timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-10-18 11:57:46,670 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 819, in kill
    await process.join(max(0, deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/process.py", line 316, in join
    await asyncio.wait_for(asyncio.shield(self._exit_future), timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-10-18 11:57:46,672 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 819, in kill
    await process.join(max(0, deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/process.py", line 316, in join
    await asyncio.wait_for(asyncio.shield(self._exit_future), timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-10-18 11:57:46,673 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 819, in kill
    await process.join(max(0, deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/process.py", line 316, in join
    await asyncio.wait_for(asyncio.shield(self._exit_future), timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2022-10-18 11:57:46,673 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/nanny.py", line 819, in kill
    await process.join(max(0, deadline - time()))
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/process.py", line 316, in join
    await asyncio.wait_for(asyncio.shield(self._exit_future), timeout)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
Traceback (most recent call last):
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/scale_wlg_persistent/filesets/project/niwa03440/geofabrics/GeoFabrics/src/main.py", line 191, in <module>
    main()
  File "/scale_wlg_persistent/filesets/project/niwa03440/geofabrics/GeoFabrics/src/main.py", line 187, in main
    launch_processor(args)
  File "/scale_wlg_persistent/filesets/project/niwa03440/geofabrics/GeoFabrics/src/main.py", line 162, in launch_processor
    run_processor_class(
  File "/scale_wlg_persistent/filesets/project/niwa03440/geofabrics/GeoFabrics/src/main.py", line 117, in run_processor_class
    runner.run()
  File "/scale_wlg_persistent/filesets/project/niwa03440/geofabrics/GeoFabrics/src/geofabrics/processor.py", line 578, in run
    self.raw_dem.add_lidar(
  File "/scale_wlg_persistent/filesets/project/niwa03440/geofabrics/GeoFabrics/src/geofabrics/dem.py", line 1327, in add_lidar
    dem = self._add_tiled_lidar_chunked(
  File "/scale_wlg_persistent/filesets/project/niwa03440/geofabrics/GeoFabrics/src/geofabrics/dem.py", line 1414, in _add_tiled_lidar_chunked
    #chunked_dem = chunked_dem.compute()
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/xarray/core/dataset.py", line 901, in compute
    return new.load(**kwargs)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/xarray/core/dataset.py", line 735, in load
    evaluated_data = da.compute(*lazy_data.values(), **kwargs)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/dask/base.py", line 600, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/client.py", line 3057, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/client.py", line 2226, in gather
    return self.sync(
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/utils.py", line 339, in sync
    return sync(
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/utils.py", line 406, in sync
    raise exc.with_traceback(tb)
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/utils.py", line 379, in f
    result = yield future
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/nesi/project/niwa03440/conda/envs/geofabrics/lib/python3.10/site-packages/distributed/client.py", line 2089, in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: Attempted to run task ('elevation_over_chunk-from-value-load_tiles_in_chunk-concatenate-8516fa3f6ef560e5bd6cf02c82d18ec1', 20, 31) on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:34046. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.
<Client: 'tcp://127.0.0.1:45727' processes=20 threads=20, memory=400.00 GiB>
rosepearson commented 1 year ago

Just a note on changes I'm stashing for now - image

rosepearson commented 1 year ago

Readme for setting up GeoFabrics on the HPC for testing @jennan: readme