mdbartos / pysheds

:earth_americas: Simple and fast watershed delineation in python.
GNU General Public License v3.0
708 stars 191 forks source link

`fill_depressions` causing kernel to crash? #187

Closed JamesSample closed 2 years ago

JamesSample commented 2 years ago

I'm trying to perform terrain pre-processing of a fairly large DEM for Norway. The dataset has a resolution of 40 m (30050 columns by 37750 rows) and a file size of 2.1 GB uncompressed (414 MB compressed, available here).

The following code causes the kernel to crash with the message Kernel Restarting. The kernel appears to have died. It will restart automatically.

from pysheds.grid import Grid

dtm_path = "norway_kartverket_40m_dtm_utm_z33.tif"

print("Reading DEM...")
grid = Grid.from_raster(dtm_path)
dem = grid.read_raster(dtm_path)

print("Filling pits...")
dem = grid.fill_pits(dem)

print("Filling depressions...")
dem = grid.fill_depressions(dem)

print("Resolving flats...")
dem = grid.resolve_flats(dem)

print("Calculating flow direction...")
dirmap = (64, 128, 1, 2, 4, 8, 16, 32)
fdir = grid.flowdir(dem, dirmap=dirmap)
del dem

print("Calculating flow accumulation...")
acc = grid.accumulation(fdir, dirmap=dirmap)

The output from the code up to the crash is:

OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Reading DEM...
Filling pits...
Filling depressions...

I have access to a fairly large machine (240 GB of RAM), and memory consumption (using top) peaks at around 35%, so I don't think I'm running out of memory. Nevertheless, 35% of 240 is 84 GB, which seems a lot for a 2 GB DEM?

If I crop 200 km off the top or bottom of the dataset the code completes successfully, so I guess I'm hitting a limit somewhere. Does anyone have any suggestions, please? Or any tips for filling depressions in large datasets (I don't think this can be done using chunks/windows)?

If I comment out the fill_depressions command the code completes successfully, but watersheds delineated from the processed DEM are poor. It's obviously an important step, but I'm not sure how to achieve it with a dataset this large.

I'm on Linux using pysheds=0.3.2.

Thank you!

mdbartos commented 2 years ago

Thanks for raising this issue. fill_depressions uses skimage.morphology.reconstruction under the hood. The function appears to be optimized in cython, but you may be able to detect limits within the source code: https://github.com/scikit-image/scikit-image/blob/main/skimage/morphology/_grayreconstruct.pyx

Another potential reason for the high memory usage is that pysheds upcasts floating point types to float64 by default to handle compatibility between different datasets and to simplify numba type signatures. You could get around this by casting the pit-filled dem to float32 (dem = dem.astype(np.float32)) and then manually running the following lines inside fill_depressions

https://github.com/mdbartos/pysheds/blob/9d960962bfe742c534da6577c3353169e6b6a501/pysheds/sgrid.py#L2161-L2169

You would need to convert dem_out from an ndarray to a Raster afterwards.

mdbartos commented 2 years ago

Figuring out how to chunk this operation over a set of tiles could also be an interesting research question.

JamesSample commented 2 years ago

Thanks for the reply! I hadn't realised the grids were internally converted to float64, so that explains the memory consumption. I tried a version using float32 instead, but had the same kernel crash, unfortunately.

As a workaround, I've used a publicly available national dataset defining Norway's biggest catchments to split the DEM into hydrologically sensible chunks. My aim is to define smaller catchments within these major catchments, so this approach should work OK in this case.

Figuring out how to do the calculation "properly" in chunks (i.e. without "cheating" by using pre-defined major catchments) would certainly be interesting! Norway is so narrow in the middle that I can actually do it in just two chunks, and the discontinuity at the boundary is quite small (but still significant). I'm not sure how to approach this in a rigorous way, though.

Thanks for PySheds - it's very useful :-)

JamesSample commented 2 years ago

I suspect the problem is related to this issue. The code in this reply causes a similar crash on my system and seems to be related to how NaNs are handled by morphology.reconstruction.

It sounds as though a workaround might be possible by replacing NaNs with some placeholder value during the call to fill_depressions. I'll have a go at this next week.

JamesSample commented 2 years ago

The current implementation of morphology.reconstruction in scikit-image will fail if 2 * nrows * ncols > np.iinfo(np.int32).max i.e. if the number of cells in the raster is greater than about 1.07e9. See this issue for details.

Hopefully this can be fixed in scikit-image. If not, given that PySheds casts everything to 64-bit internally anyway, it would be fairly easy to create a 64-bit version of morphology.reconstruction for use in PySheds.

eorland commented 2 years ago

I came across this issue before thinking about raising one of my own when generating catchments:

Windows fatal exception: access violation

Main thread:
Current thread 0x00002218 (most recent call first):
  File "C:\Users\eorland\Anaconda3\envs\pywater\lib\site-packages\pysheds\sgrid.py", line 761 in _d8_catchment
  File "C:\Users\eorland\Anaconda3\envs\pywater\lib\site-packages\pysheds\sgrid.py", line 737 in catchment
  File "C:\Users\eorland\Documents\pfdf\scripts\process_basin_data.py", line 147 in get_basin_polygons
  File "C:\Users\eorland\Documents\pfdf\scripts\process_basin_data.py", line 123 in <lambda>
  File "C:\Users\eorland\Anaconda3\envs\pywater\lib\site-packages\pandas\core\apply.py", line 867 in apply_series_generator
  File "C:\Users\eorland\Anaconda3\envs\pywater\lib\site-packages\pandas\core\apply.py", line 851 in apply_standard
  File "C:\Users\eorland\Anaconda3\envs\pywater\lib\site-packages\pandas\core\apply.py", line 727 in apply
  File "C:\Users\eorland\Anaconda3\envs\pywater\lib\site-packages\pandas\core\frame.py", line 8833 in apply
  File "C:\Users\eorland\Anaconda3\envs\pywater\lib\site-packages\geopandas\geodataframe.py", line 1390 in apply
  File "C:\Users\eorland\Documents\pfdf\scripts\process_basin_data.py", line 123 in shed_workflow
  File "C:\Users\eorland\Documents\pfdf\scripts\process_basin_data.py", line 241 in check_files
  File "C:\Users\eorland\Documents\pfdf\scripts\process_basin_data_workflow.py", line 21 in <module>
  File "C:\Users\eorland\Anaconda3\envs\pywater\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 465 in exec_code
  File "C:\Users\eorland\Anaconda3\envs\pywater\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 585 in runfile
  File "C:\Users\eorland\AppData\Local\Temp\1\ipykernel_11452\502233501.py", line 1 in <cell line: 1>

Restarting kernel...

To be fair, this is frequently when I'm trying to generate catchments for several hundred pour points at a time using the .apply method on a GeoDataFrame . I noticed that when I switch to algorithm='recursive' this problem goes away.

JamesSample commented 2 years ago

This is now fixed in the main branch of scikit-image by https://github.com/scikit-image/scikit-image/pull/6342.

(Note that it's not included in the current release of scikit-image, so installing from source is required for the moment).

diedro2 commented 1 year ago

I am still experiencing the crash problem. Honestly, I do not know if that to a memory problem or to NaN in skimage.morphology.reconstruction. I have tried to create a mask in order to remove the NaN following your suggestions. However, the program crash as I run "skimage.morphology.reconstruction". I have a dem as array of float64 (8879,11754). I have a RAM memory of 16100648 kB. What do you think?

JamesSample commented 1 year ago

@diedro2 It's hard to be certain without profiling, but this could well be a memory issue. 8879 x 11754 x 8 bytes is roughly 850 MB. skimage.morphology.reconstruction tries to be memory efficient, but nevertheless makes a lot of copies and intermediates - see the related post here that suggests peak memory consumption can be more than 30 times the grid size. In that example, the call to fill_depressions alone consumed 8.7 GB of memory for a 380 MB grid. Scaling this to 850 MB, as in your case, suggests fill_depressions could use ~19 GB of memory. In my example at the start of this thread, peak memory consumption was more than 40 times the grid size (>80 GB with a 2 GB grid), so for your data it seems likely you'll need a machine with at least 32 GB of RAM.

The issue with NaN's that I originally linked was not actually the problem in my case, and I believe PySheds handles this correctly. The real issue for me was a bug in skimage.morphology.reconstruction (now hopefully fixed), which caused numerical overflow for very large grids (>1e9 cells). This shouldn't be the issue in your case, though, as 8879 x 11754 ~1e8.

I recommend trying some memory profiling, as in the link above. Even just watching your system monitor in the runup to the crash (e.g. using Task Manager on Windows or the top command on Linux) should give you a pretty good indication of whether it's a memory issue.

Good luck!