`get_neighbour_info` slows down significantly when working with large target rasters using many segments

SwamyDev commented 1 year ago

Code Sample, a minimal, complete, and verifiable piece of code

# benchmark created from modified documentation example
# note: you will need quite a lot of RAM for this benchmark (~40GB), 
# however, you can also reduce the size to see the effect mentioned.
import numpy as np

from pyresample.geometry import SwathDefinition, AreaDefinition
from pyresample.kd_tree import get_neighbour_info

proj_dsc = {'a': '6378144.0', 'b': '6356759.0',
            'lat_0': '50.00', 'lat_ts': '50.00',
            'lon_0': '8.00', 'proj': 'stere'}

lons = np.fromfunction(lambda y, x: 3 + (x / 4), (200, 30))
lats = np.fromfunction(lambda y, x: 75 - (y / 3), (200, 30))
swath = SwathDefinition(lons, lats)
area = AreaDefinition('areaD', 'Europe 3km', 'areaD', proj_dsc, 15000, 15000, [-1370912.72, -909968.64,1029087.28, 1490031.36])
val_in_idc, val_out_idc, idc, distances = get_neighbour_info(swath, area, 50000, 16, nprocs=16, segments=150, preallocate_size=area.size)

Problem description

I am working on a software application that deals with large raster files (approximately 15000x15000 pixels). Recently, I encountered an issue where the get_neighbour_info function was significantly slowing down, when I used multiple segments to fit the kd_tree lookups into RAM. The code snipped provided is a simplified benchmark demonstrating the issue. Upon profiling the code using py-spy, I discovered that the concatenation of segments was the cause of the slow down:

As you can see, the current implementation results in many copies of the segments, which would even result in a peak RAM usage of 2 * total_size - segment_size.

Expected Output

To address this, a solution would involve pre-allocation of result arrays to avoid memcpys. Thanks to your habitable code base and excellent test coverage, I was able to quickly draft up a solution proposal, which you can find in the linked pull request. By pre-allocating the result arrays, most time is spent in the different workers querying the kd_tree, instead of copying data:

Also the execution time went down from ~35min to just ~4min.

Associated pull request

In this first draft I've opted for adding a simple pre-allocation size parameter to get_neighbour_info, to avoid increasing the surface of the public API. However, another option would be to introduce a new method like put_neighbour_info where the results are not returned, but provided as output parameters. This would give users even more flexibility such as specifying the result data type. However this comes at the cost of another public API function, and probably some significant refactoring of get_neighbour_info to avoid code duplication. What do you think?

djhoese commented 1 year ago

Awesome! Nice job.

which you can find in the linked pull request.

Sorry, I don't see a link anywhere.

I don't use segments usually so forgive me if I'm misunderstanding, but would an alternative solution be to pre-allocate the final target array no matter what and fill it in with results? Then if a user wanted to provide the array that was being filled in they could provide an out keyword argument to resemble the standard practice in numpy?

djhoese commented 1 year ago

Ah there's the PR. Thanks.

SwamyDev commented 1 year ago

Awesome! Nice job.

Thanks :)

which you can find in the linked pull request.

Sorry, I don't see a link anywhere.

You were apparently way to fast for me ;)

I don't use segments usually so forgive me if I'm misunderstanding, but would an alternative solution be to pre-allocate the final target array no matter what and fill it in with results? Then if a user wanted to provide the array that was being filled in they could provide an out keyword argument to resemble the standard practice in numpy?

Well there are cases where the index and distance arrays are much smaller then predicted by the target area, i.e. when working with masked data. At first I did exactly that, but one of your tests immediately caught me. One could pre-allocate the array with the output size preemptively, and then trim everything at the end again to avoid this issue, but that might increase memory usage significantly for some of your users and break their code.

I haven't thought of the out keyword option, however, to avoid a new API function. It does indeed resemble standard practice of numpy which is probably be a good idea to follow. I'm just not a fan of this setup, because it is not very expressive and it changes the return behaviour of the function.

pytroll / pyresample