rosepearson / GeoFabrics

A package for generating hydrologically conditioned DEMs and roughness maps from LiDAR and other infrastructure data. Check the wiki for install and usage instructions, and documentation at https://rosepearson.github.io/GeoFabrics/
GNU General Public License v3.0
29 stars 11 forks source link

Performance - make use of CPU clusters #32

Closed rosepearson closed 3 years ago

rosepearson commented 3 years ago

Make use of Dask to parallelise the rasterisation of many LiDAR tile files into non-overlapping chunks that will form a single xarray. Each chunk will contain several LiDAR tile files - all that partially overlap with the chunk will be loaded. This will mean that some files are loaded 2-4 times, but overall we can expect a speed-up.

Changes required:

  1. Switch from processing by LiDAR files to processing by chunks each containing many LiDAR files (this requires a tile index file)
  2. Generate a map of dense data extents after chunking all LiDAR data prior to offshore calculations
  3. Make use of the dask.delayed decorator and dask.arrays and dask.concatentate
  4. Consider using the dask distributed back-end for a dashboard and more control

Resources:

  1. Example dask ipynb - https://github.com/rosepearson/Hydrologic-DEMs-scripts/blob/main/jupyter_notebooks/dask_example.ipynb
  2. Overlapping chunks - https://docs.dask.org/en/stable/array-overlap.html
  3. Back-end schedulers - https://docs.dask.org/en/stable/scheduling.html
rosepearson commented 3 years ago

Explore using the Python multiprocessing package to speed up the geofabrics package by allowing for multiple processors to be used in parallel - both on a desktop computer/laptop but also in a cluster/HPC environment.

I have done some investigation into different options for importing the performance of GeoFabrics.

Notes on code structures

There are two areas that seem embarrassingly parallelisable.

  1. Loading LiDAR tiles and creating raster patches (in an Xarray)
  2. Solving a function (SciPy Linear RBF) over raster patches (into sections of a numpy array and eventually an Xarray). The first stage is complicated by a file write step that is th eonly way I can get PDAL to create a raster (using PDAL writers.gdal). I have created a numpy based implementation of this that should be comparable speed wise if parallelised (see this commit).

    Notes on parallelisation options considered/tried

  3. Multiprocessing - This seems like it should work and I have tried using shared memory and haven't achieved a speedup yet presumably sue to the overhead in accessing the shared memory. I've also concluded that while it should be possible to get a good speedup using the multiprocessing module it is probably lower-level than I want to go.
  4. Numba - It is best suited to looks and functions making extensive use of numpy. While, i tick these boxes it doesn't seem to compile with either scipy.interpolant.rbf or scipy.spatial.KDTree.
  5. Dask - This will create dependency tree's show how code can be optimised and provides several APIs for optimising code in different ways. It works well on pandas and Xarray types. So far this seems like the best option. i. I tried doing this naively in on an python IDW implementation using the scipy.sptatial.KDTree and managed to cause a 10x slow down, so I'm obviously not using Dask well and creating more overhead that anything. Thinking about it more, I think I will want to initially (and possibly entirely) make use of Dask at a higher level - i.e. when creating the XArray file and reading in different LiDAR files.
rosepearson commented 3 years ago

Plan

Based on some conversations with colleagues at NIWA (Wolfgang and Maxime) I will have a go using xarray ufunc (which can be used with Dask) to load in LiDAR files and interpolated the point clouds into xarray tiles.

Other notes

Finally, Wolfgang's suggestion was to use: xarray ufunc, where the worker function loads the necessary LiDAR file(s), performs the PDAL processing, and pixel-averaging for a given tile in the xarray-array

rosepearson commented 3 years ago

Python IDW implementation timings

The testing done using the idw_function.py in Hydrologic-DEMs-scripts showed the scipy implementation to be ~ 30% faster for my test LiDAR file.

Leaf size 10, eps for scipy 0

The max between the PDAL writers.gdal and scipy implementation is 1.8189894035458565e-12 The max between the Python KDTree implementations is 1.9895196601282805e-13 Mean GDAL and std time: 3.9056448221206663 and 0.24172256920947374 Mean SciPy and std time: 8.707992148399352 and 0.1381356633160139 Mean GDAL and std time: 12.08405842781067 and 0.48329604026074485

Leaf size 40, eps for scipy 0

Leaf size is: 10, and esp (for scipy is): 0.1

Leaf size is: 10, and esp (for scipy is): 0.1, and GDAL size of 100x

rosepearson commented 3 years ago

Planned approach prior to meetings with Maxime

Make use of Dask to parallelise the rasterisation of LiDAR tile files into almost-non-overlapping portions (edge pixels overlap) of a single xarray.

I have looked into using xarray.apply_ufunc but it appears that this works best when the same process is applied along different chunks (as defined by Dask) or dimensions, where I would like to apply the rasterisation along different chunks (as defined spatially - not in memory).

xarray documentation: