rosepearson commented 3 years ago

Make use of Dask to parallelise the rasterisation of many LiDAR tile files into non-overlapping chunks that will form a single xarray. Each chunk will contain several LiDAR tile files - all that partially overlap with the chunk will be loaded. This will mean that some files are loaded 2-4 times, but overall we can expect a speed-up.

Changes required:

Switch from processing by LiDAR files to processing by chunks each containing many LiDAR files (this requires a tile index file)
Generate a map of dense data extents after chunking all LiDAR data prior to offshore calculations
Make use of the dask.delayed decorator and dask.arrays and dask.concatentate
Consider using the dask distributed back-end for a dashboard and more control

Resources:

Example dask ipynb - https://github.com/rosepearson/Hydrologic-DEMs-scripts/blob/main/jupyter_notebooks/dask_example.ipynb
Overlapping chunks - https://docs.dask.org/en/stable/array-overlap.html
Back-end schedulers - https://docs.dask.org/en/stable/scheduling.html

rosepearson commented 3 years ago

Explore using the Python multiprocessing package to speed up the geofabrics package by allowing for multiple processors to be used in parallel - both on a desktop computer/laptop but also in a cluster/HPC environment.

I have done some investigation into different options for importing the performance of GeoFabrics.

Notes on code structures

There are two areas that seem embarrassingly parallelisable.

Loading LiDAR tiles and creating raster patches (in an Xarray)
Solving a function (SciPy Linear RBF) over raster patches (into sections of a numpy array and eventually an Xarray). The first stage is complicated by a file write step that is th eonly way I can get PDAL to create a raster (using PDAL writers.gdal). I have created a numpy based implementation of this that should be comparable speed wise if parallelised (see this commit).
Notes on parallelisation options considered/tried
Multiprocessing - This seems like it should work and I have tried using shared memory and haven't achieved a speedup yet presumably sue to the overhead in accessing the shared memory. I've also concluded that while it should be possible to get a good speedup using the multiprocessing module it is probably lower-level than I want to go.
Numba - It is best suited to looks and functions making extensive use of numpy. While, i tick these boxes it doesn't seem to compile with either scipy.interpolant.rbf or scipy.spatial.KDTree.
Dask - This will create dependency tree's show how code can be optimised and provides several APIs for optimising code in different ways. It works well on pandas and Xarray types. So far this seems like the best option. i. I tried doing this naively in on an python IDW implementation using the scipy.sptatial.KDTree and managed to cause a 10x slow down, so I'm obviously not using Dask well and creating more overhead that anything. Thinking about it more, I think I will want to initially (and possibly entirely) make use of Dask at a higher level - i.e. when creating the XArray file and reading in different LiDAR files.

rosepearson commented 3 years ago

Plan

Based on some conversations with colleagues at NIWA (Wolfgang and Maxime) I will have a go using xarray ufunc (which can be used with Dask) to load in LiDAR files and interpolated the point clouds into xarray tiles.

Other notes

Move to using a python implementation of IDW as this will give me direct access to a KDTree that can also be used for later roughness calculations
- Also look at using the sciKit-learn implementation in place of the scipy.spatial implementation as this benchmarking while this one bench-marking shows no SciPy is faster (see comments).
- Below I have done some comparisons between PDAL writers.gdal, SciPy and SciKit-Learn. It looks like SciPy is better suited than SciKit-Learn. It also looks like the current approach of writing the whole catchment region for each writers.gdal tile to ensure expected alignment has a substantial (~2x) performance cost.
- Finally, I think this change is best made separately in its own branch before the performance change as it will simplify the code structure.
Do some testing in Hydrologic-DEMs-scripts to understand how or if results will change as a result of moving from the GDAL to numpy IDW approach.
A good source of examples for Dask is https://carpentries-lab.github.io/python-aos-lesson/

Finally, Wolfgang's suggestion was to use: xarray ufunc, where the worker function loads the necessary LiDAR file(s), performs the PDAL processing, and pixel-averaging for a given tile in the xarray-array

rosepearson commented 3 years ago

Python IDW implementation timings

The testing done using the idw_function.py in Hydrologic-DEMs-scripts showed the scipy implementation to be ~ 30% faster for my test LiDAR file.

Leaf size 10, eps for scipy 0

The max between the PDAL writers.gdal and scipy implementation is 1.8189894035458565e-12 The max between the Python KDTree implementations is 1.9895196601282805e-13 Mean GDAL and std time: 3.9056448221206663 and 0.24172256920947374 Mean SciPy and std time: 8.707992148399352 and 0.1381356633160139 Mean GDAL and std time: 12.08405842781067 and 0.48329604026074485

Leaf size 40, eps for scipy 0

The max between the PDAL writers.gdal and scipy implementation is 1.8189894035458565e-12
The max between the Python KDTree implementations is 1.7053025658242404e-13
Mean GDAL and std time: 3.950049090385437 and 0.25588853563356073
Mean SciPy and std time: 8.132488775253297 and 0.24917624242347494
Mean GDAL and std time: 11.383039379119873 and 0.327593975311452
Mean SciPy and std time functions: 8.10877296924591 and 0.18421296747238847
Mean GDAL and std time functions: 11.462233018875121 and 0.3721544448316608

Leaf size is: 10, and esp (for scipy is): 0.1

The max between the PDAL writers.gdal and scipy implementation is 0.7554746021340719
The max between the Python KDTree implementations is 0.755474602134143
Mean GDAL and std time: 3.914770078659058 and 0.2314372355708701
Mean SciPy and std time: 8.259156107902527 and 0.12526470832286685
Mean SciKit-Learn and std time: 11.96186547279358 and 0.09064785229280034
Mean SciPy and std time functions: 8.25824522972107 and 0.07784591736984733
Mean SciKit-Learn and std time functions: 12.018409895896912 and 0.20965427531855918

Leaf size is: 10, and esp (for scipy is): 0.1, and GDAL size of 100x

The max between the PDAL writers.gdal and scipy implementation is 0.7554746021340719
The max between the Python KDTree implementations is 0.755474602134143
Mean GDAL and std time: 9.707366514205933 and 0.4339296288368706
Mean SciPy and std time: 8.676866602897643 and 0.3205360563555423
Mean SciKit-Learn and std time: 12.117697739601136 and 0.29163889418593236
Mean SciPy and std time functions: 8.753468823432922 and 0.21306887917410763
Mean SciKit-Learn and std time functions: 12.050393390655518 and 0.3096917319001311
Leaf size is: 10, and esp (for scipy is): 0, and GDAL size of 100x
The max between the PDAL writers.gdal and scipy implementation is 1.8189894035458565e-12
The max between the Python KDTree implementations is 1.9895196601282805e-13
Mean GDAL and std time: 9.884694838523865 and 0.2507694180090204
Mean SciPy and std time: 9.141642236709595 and 0.33998392481106043
Mean SciKit-Learn and std time: 12.37787389755249 and 0.30228261187495525
Mean SciPy and std time functions: 9.208562016487122 and 0.28436964360134204
Mean SciKit-Learn and std time functions: 12.292914533615113 and 0.20991275536395373

rosepearson commented 3 years ago

Planned approach prior to meetings with Maxime

Make use of Dask to parallelise the rasterisation of LiDAR tile files into almost-non-overlapping portions (edge pixels overlap) of a single xarray.

I have looked into using xarray.apply_ufunc but it appears that this works best when the same process is applied along different chunks (as defined by Dask) or dimensions, where I would like to apply the rasterisation along different chunks (as defined spatially - not in memory).

xarray documentation:

xarray.apply_ufunc example xarray official documentation
xarray with Dask dask documantion
xarray.apply_ufunc xarray documentation

rosepearson / GeoFabrics

Performance - make use of CPU clusters #32

Changes required:

Resources:

Notes on code structures

Notes on parallelisation options considered/tried

Plan

Other notes

Python IDW implementation timings

Leaf size 10, eps for scipy 0

Leaf size 40, eps for scipy 0

Leaf size is: 10, and esp (for scipy is): 0.1

Leaf size is: 10, and esp (for scipy is): 0.1, and GDAL size of 100x

Leaf size is: 10, and esp (for scipy is): 0, and GDAL size of 100x

Planned approach prior to meetings with Maxime

xarray documentation: