Closed rosepearson closed 3 years ago
Explore using the Python multiprocessing package to speed up the geofabrics package by allowing for multiple processors to be used in parallel - both on a desktop computer/laptop but also in a cluster/HPC environment.
I have done some investigation into different options for importing the performance of GeoFabrics.
There are two areas that seem embarrassingly parallelisable.
Based on some conversations with colleagues at NIWA (Wolfgang and Maxime) I will have a go using xarray ufunc (which can be used with Dask) to load in LiDAR files and interpolated the point clouds into xarray tiles.
Finally, Wolfgang's suggestion was to use: xarray ufunc, where the worker function loads the necessary LiDAR file(s), performs the PDAL processing, and pixel-averaging for a given tile in the xarray-array
The testing done using the idw_function.py in Hydrologic-DEMs-scripts showed the scipy implementation to be ~ 30% faster for my test LiDAR file.
The max between the PDAL writers.gdal and scipy implementation is 1.8189894035458565e-12 The max between the Python KDTree implementations is 1.9895196601282805e-13 Mean GDAL and std time: 3.9056448221206663 and 0.24172256920947374 Mean SciPy and std time: 8.707992148399352 and 0.1381356633160139 Mean GDAL and std time: 12.08405842781067 and 0.48329604026074485
The max between the PDAL writers.gdal and scipy implementation is 0.7554746021340719
The max between the Python KDTree implementations is 0.755474602134143
Mean GDAL and std time: 9.707366514205933 and 0.4339296288368706
Mean SciPy and std time: 8.676866602897643 and 0.3205360563555423
Mean SciKit-Learn and std time: 12.117697739601136 and 0.29163889418593236
Mean SciPy and std time functions: 8.753468823432922 and 0.21306887917410763
Mean SciKit-Learn and std time functions: 12.050393390655518 and 0.3096917319001311
The max between the PDAL writers.gdal and scipy implementation is 1.8189894035458565e-12
The max between the Python KDTree implementations is 1.9895196601282805e-13
Mean GDAL and std time: 9.884694838523865 and 0.2507694180090204
Mean SciPy and std time: 9.141642236709595 and 0.33998392481106043
Mean SciKit-Learn and std time: 12.37787389755249 and 0.30228261187495525
Mean SciPy and std time functions: 9.208562016487122 and 0.28436964360134204
Mean SciKit-Learn and std time functions: 12.292914533615113 and 0.20991275536395373
Make use of Dask to parallelise the rasterisation of LiDAR tile files into almost-non-overlapping portions (edge pixels overlap) of a single xarray.
I have looked into using xarray.apply_ufunc
but it appears that this works best when the same process is applied along different chunks (as defined by Dask) or dimensions, where I would like to apply the rasterisation along different chunks (as defined spatially - not in memory).
Make use of Dask to parallelise the rasterisation of many LiDAR tile files into non-overlapping chunks that will form a single xarray. Each chunk will contain several LiDAR tile files - all that partially overlap with the chunk will be loaded. This will mean that some files are loaded 2-4 times, but overall we can expect a speed-up.
Changes required:
Resources: