Add pre-allocation option to `get_neighbour_info` to improve performance on large raster data

pytroll / pyresample

Geospatial image resampling in Python

http://pyresample.readthedocs.org

GNU Lesser General Public License v3.0

346 stars 94 forks source link

Add pre-allocation option to `get_neighbour_info` to improve performance on large raster data #505

Closed SwamyDev closed 1 year ago

SwamyDev commented 1 year ago

When applying get_neighbour_info to large target arrays with many segments it significantly slows down due to memcpys performed when concatenating segments. By allowing the user to pre-allocate result arrays these can be avoided resulting in significant speed ups. See also the linked issue.

[ ] Closes #504
[x] Tests added
[x] Tests passed
[x] Passes git diff origin/main **/*py | flake8 --diff
[x] Fully documented

codecov[bot] commented 1 year ago

Codecov Report

Merging #505 (b36e047) into main (0340c4f) will increase coverage by 0.04%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #505      +/-   ##
==========================================
+ Coverage   94.27%   94.31%   +0.04%     
==========================================
  Files          78       79       +1     
  Lines       12838    12898      +60     
==========================================
+ Hits        12103    12165      +62     
+ Misses        735      733       -2

Flag	Coverage Δ
unittests	`94.31% <100.00%> (+0.04%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pyresample/kd_tree.py	`92.45% <100.00%> (+0.43%)`	:arrow_up:
pyresample/test/test_utils.py	`100.00% <100.00%> (ø)`
pyresample/utils/row_appendable_array.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

coveralls commented 1 year ago

Coverage: 93.839% (+0.04%) from 93.795% when pulling b36e04717d84dabf1ec56c6c2791917128de118a on SwamyDev:main into 0340c4f507153c56b831fafd1d95c5f4b306add8 on pytroll:main.

djhoese commented 1 year ago

Let's move the conversation to here. In the issue you mentioned:

I haven't thought of the out keyword option, however, to avoid a new API function. It does indeed resemble standard practice of numpy which is probably be a good idea to follow. I'm just not a fan of this setup, because it is not very expressive and it changes the return behaviour of the function.

I think I would agree if it wasn't so prevalent in numpy. But it is also uglier in our case since this function is returning multiple things. So yeah I'm not sold on it either, but I'd really like a more predictive solution if it is at all possible. Basically if we're going to update this why not go all the way to having the final output arrays allocated once.

Well there are cases where the index and distance arrays are much smaller then predicted by the target area, i.e. when working with masked data. At first I did exactly that, but one of your tests immediately caught me.

Do you remember which test this ^ was? Our tests are not infallible and we sometimes get into cases where we over-mock our tests so they aren't always realistic (I'm working on reducing these cases).

Also note that I'm working on a 2.0 version of Pyresample so if there are obvious optimizations we want to include in that version that would break too much in 1.x then let's keep those in mind and discuss them. Nowadays the pyresample maintainers (who are also Satpy maintainers) have been switching to dask for doing work with large datasets so we maybe haven't hit this type of segmented performance hit as early as we would have otherwise.

SwamyDev commented 1 year ago

I think I would agree if it wasn't so prevalent in numpy. But it is also uglier in our case since this function is returning multiple things. So yeah I'm not sold on it either, but I'd really like a more predictive solution if it is at all possible. Basically if we're going to update this why not go all the way to having the final output arrays allocated once.

Well, one approach to deal with the multiple output issue, would be to introduce a result dataclass object. It would also make the code a tiny bet more descriptive as the return arrays would get proper names. However, that would again break API contracts, albeit one that would probably be easily fixable for users. The result object could also provide a to_tuple function which would return the arrays in exactly the order as before, so people upgrading from previous versions would just need to add this call. But then again, it would probably produce quite some code changes.

Do you remember which test this ^ was? Our tests are not infallible and we sometimes get into cases where we over-mock our tests so they aren't always realistic (I'm working on reducing these cases).

I've just recreated the error on my end and this was the offending test: test_nearest_masked_swath_target in test_kd_tree.py:128

Also note that I'm working on a 2.0 version of Pyresample so if there are obvious optimizations we want to include in that version that would break too much in 1.x then let's keep those in mind and discuss them. Nowadays the pyresample maintainers (who are also Satpy maintainers) have been switching to dask for doing work with large datasets so we maybe haven't hit this type of segmented performance hit as early as we would have otherwise.

Alright good to know that there is a 2.0 version in the making :). Then maybe I'd suggest introducing the return object and the respective out parameter into the next major version - if you think that a return object approach is suitable that is. We could integrate the less intrussive pre-alloc size change into this version? Or maybe skip it all togther and I just work with my fork for the time being until 2.0 comes out.

In any case, I'd be happy to integrate/help implement the output parameter approach no matter which version.

SwamyDev commented 1 year ago

I've been thinking about this a lot and I'm wondering if preallocate_size shouldn't be a kwarg. Instead what about just always doing it? I'm not sure why the user needs control over how much is pre-allocated. They are going to get the final array no matter what and that will be larger than or equal to any logical pre-allocated size, right?

Yes, in general that would be no problem, however, the test_nearest_masked_swath_target shows significantly lower size for next_ia and next_da, weach means if I always pre-allocate with the area size, RAM usage might become prohibitively large for users of this feature. But I think the lazy initialisation dependent on the first segment solves this problem elegantly.

Could we instead pass the number of segments to RowAppendableArray and use the first appended array to estimate the preallocation? So int(segments * first_append.shape[0] * 0.75) to get 75% the number of rows as the first appended thing.

However, I wouldn't do the 75% thing because that gets you exactly to the initial problem of unncessary copies. It is then actually guaranteed to happen because you will always have a pre-allocated array that is a little bit too small. Instead I would just allocate int(segements * first_appended.shape[0]). This already neatly solved the problem of the smaller next_ia and next_da sizes in the test_nearest_masked_swath_target case and we can drop the preallocate_size kwarg. I'll implement it and we can have a look if it works/we like it.

Are the arrays being appended here all 1D though? I mean the next_voi, next_ia, next_da from _query_resample_kdtree. Or does that increase when you do more than one neighbor? Or am I just completely wrong?

Yes if you increase the neighbours, next_ia and next_da have an additional dimension, however, that shouldn't be an issue, as this covered by tests and seems to work without any problems.

Not sure if I mentioned this, but I think the long term solution would be to pass the "output" arrays for each of these to _query_resample_kdtree or put the segmenting logic into _query_resample_kdtree so the creation of the intermediate segment results is in the same spot as the large array.

Makes sense to me to put it into _query_resample_kdtree. Also I think the long term solution might involve the out kwarg you mentioned above and packing the result arrays into a dataclass or result object, i.e.:

@dataclass
class NeighbourInfo:
    valid_input_index: NDArray
    valid_output_index: NDArray
    index_array: NDArray
    distance_array: NDArray

   @classmethod
   def make_preallocated(cls, in_size, out_size, n_neighbours):
       cls(np.empty(in_size, dtype=bool), np.empty(out_size, dtype=bool), 
           np.empty((out_size, n_neighbours), dtype=np.uint64), 
           np.empty((out_size, n_neighbours), dtype=np.float64))

def get_neighbour_info(source_geo_def, target_geo_def, radius_of_influence,
                       neighbours=8, epsilon=0, reduce_data=True,
                       nprocs=1, segments=None, out=None):
    """
    ...
    Returns
    -------
    NeighbourInfo : dataclass object of numpy arrays containing
        Neighbour resampling info
    """
    ...
# default usage:
neigbours = get_neighbour_info(src, target, 6000)
do_some_neighbour_lookup(neighbours.index_array)
...

# pre-allocated usage with helper factory
neighbours = NeighbourInfo.make_preallocated(src.size, target.size, 8)
get_neighbour_info(src, target, 6000, out=neighbours)
do_some_neighbour_lookup(neighbours.index_array)
...

# pre-allocation with complete control allowing for instance dtype specification
neighbours = NeighbourInfo(np.empty(src.size, dtype=bool), np.empty(target.size, dtype=bool), 
           np.empty((target.size, 8), dtype=np.uint16), 
           np.empty((target.size, 8), dtype=np.float32))
get_neighbour_info(src, target, 6000, out=neighbours)
do_some_neighbour_lookup(neighbours.index_array)
...

SwamyDev commented 1 year ago

Oh, and I forgot NeighbourInfo could also provide a to_tuple function to make it easier for users to switch to the new API when updating, i.e.:

@dataclass
class NeighbourInfo:
    valid_input_index: NDArray
    valid_output_index: NDArray
    index_array: NDArray
    distance_array: NDArray

    @classmethod
    def make_preallocated(cls, in_size, out_size, n_neighbours):
       ...

    def to_tuple(self):
        return self.valid_input_index, self.valid_output_index, self.index_array, self.distance_array

# example user-code before API update
ingest_neighbour_info(*get_neighbour_info(src, target, 6000))

# example user-code after API update
ingest_neighbour_info(*get_neighbour_info(src, target, 6000).to_tuple())

It would require users to update all their get_neighbour_info calls, but given the simplicity of the change and also that in my opinion it is quite safe, I think it is not expecting too much.

Of course there is a fine line between backwards compatibility and API improvements, which should be discussed. I think in this case the improvement justifies this slight inconvenience in adding the to_tuple calls, which is of course why I suggested it ;).

djhoese commented 1 year ago

I'd rather have it just be a namedtuple or tuple subclass I think. But that's probably not needed for this version of the API...yet.

SwamyDev commented 1 year ago

True a named tuple would also do the trick nicely.

djhoese commented 1 year ago

However, I wouldn't do the 75% thing because that gets you exactly to the initial problem of unncessary copies. It is then actually guaranteed to happen because you will always have a pre-allocated array that is a little bit too small. Instead I would just allocate int(segements * first_appended.shape[0]). This already neatly solved the problem of the smaller next_ia and next_da sizes in the test_nearest_masked_swath_target case and we can drop the preallocate_size kwarg. I'll implement it and we can have a look if it works/we like it.

I'm confused now. In the case where no data is masked/dropped, then int(segments * first_appended.shape[0]) will give you the full output array, right? Isn't that what you wanted to avoid? If all other segments following the first one are masked and therefore smaller, wouldn't that be a lot of wasted space which you were concerned about?

In the case where the first segment array is shorter than all other segments then you end up with a smaller array and have to do .append which results in copies.

The 75% was off the top of my head with the idea that it would minimize the number of copies (.append calls) to 1-3 in the no-masked data case...at least I think.

SwamyDev commented 1 year ago

I'm confused now. In the case where no data is masked/dropped, then int(segments * first_appended.shape[0]) will give you the full output array, right? Isn't that what you wanted to avoid? If all other segments following the first one are masked and therefore smaller, wouldn't that be a lot of wasted space which you were concerned about?

The thing I was worried about is that the total size of the index_array and distance_array is much smaller, when masked arrays are used as input. At least that is what I observed in the test_nearest_masked_swath_target test case. However, your suggestion using the number of expected segments and the size of the first segments gives me at least in this case already a good estimate for the total size of those two arrays.

In the case where the first segment array is shorter than all other segments then you end up with a smaller array and have to do .append which results in copies.

Can this happen? I haven't observed this so far, but tbh I also didn't look this closely, I'll investigate.

The 75% was off the top of my head with the idea that it would minimize the number of copies (.append calls) to 1-3 in the no-masked data case...at least I think.

Yes, that's true, however, if the common case is that segments are of the same size, or at least that the first segment is as big as it can be, it would be better to not reduce the initial size, as the number of allocated bytes would usually be correct. Only if the first segment is smaller then the rest would you then get a few appends at then end. But from what I've seen so far how the code behaves in the tests, I'd assume that this occurs rarely?

djhoese commented 1 year ago

In the case where the first segment array is shorter than all other segments then you end up with a smaller array and have to do .append which results in copies.

Can this happen? I haven't observed this so far, but tbh I also didn't look this closely, I'll investigate.

If I'm understanding the masking you're talking about, I don't see why any combination of the segments could be shorter than the max size. The masking is done when output pixels are off the Earth (right?), so depending on the target AreaDefinition you could have the first segment contain space or invalid pixels, you could have any of the middle segments, or you could have the last segment. So it feels like we either code for the "reduce numbers of copies but still assume they will happen" case or the "take up more memory than might be needed but never require appends/copies" case.

SwamyDev commented 1 year ago

So, I had a closer look into the _query_resample_kdtree function, and as far as I understand it pixels are masked away either if pixels are off the Earth as you mentioned, or if a masked numpy arrays are passed as part of the target area definition. If I understand this correctly in kd_tree.py:454 the "pixels fallen of the Earth" are combined with the mask of the output coordinates defined by the user and in line 455 everything is converted to a standard numpy array.

This means the user can control the size of the output area significantly, by masking many coordinates of the target area, right? But, I think by using the calculate based on first segment size approach, we already take care of this, as the number of user-masked target area coordinates doesn't change during the process.

This leaves the pixels that fell of the Earth. My assumption would be, that this is a relatively rare occurrence, compared to the standard case, or at least if it happens it doesn't usually affect a significant chunk of pixels. However, I could be totally wrong about that.

But still, my feeling would be that it would be better to code for the "take up more memory than might be needed but never require appends/copies", because with the current solution the additional memory would usually be relatively small (my gut feeling would tell me ~10-20% max) as compared to almost two times peak memory usage if a copy is required when appending segments close to the end of the iteration.

I might be biased here, because in my specific applications even one copy at the end already makes RAM usage almost prohibitively large as my target area is huge, but I could afford 10-20% more (living on the edge ;)).

Now thinking about this we could also take care of this issue by pre-allocating upfront, but instead of using target_area.size we could so something like target_area.size_valid. This would take care of users which have significantly smaller RAM usage because they have masked huge chunks of their target area, and also avoid calculating the pre-allocation size based on a chunk which contained invalid pixels because some fell of the Earth. This would still mean that we might pre-allocate a bit too much memory, depending on how many pixels turn out to be off Earth, but I'd assume people usually expect that all or almost all pixels in their target area would be valid. What do you think?

djhoese commented 1 year ago

This means the user can control the size of the output area significantly, by masking many coordinates of the target area, right? But, I think by using the calculate based on first segment size approach, we already take care of this, as the number of user-masked target area coordinates doesn't change during the process.

I'm not sure how common it is to provide a target area (in this case a SwathDefinition) where a significant portion of the pixels are masked. In those cases, I would suggest a user provide a smaller version of the target area. The use case we (maintainers of pyresample and satpy) have for a SwathDefinition with invalid pixels is space-based weather instruments whose scanning pattern (the way they observe the Earth) results in invalid or duplicated pixels which are all masked away. This use case of using a SwathDefinition is not what I would consider our usual case though as most users want to grid/project/resample their data from a non-uniform swath of pixels, not to a swath.

The off-the-Earth case is decently common for geostationary weather satellites where they see the "full disk" of the Earth, but have rectangular scanning patterns so the corners of their data arrays are all space pixels (invalid).

Now thinking about this we could also take care of this issue by pre-allocating upfront, but instead of using target_area.size we could so something like target_area.size_valid.

What is .size_valid?

I'm starting to lean towards a full pre-allocation the size of the target area and remove any unsed rows at the end. I don't think we can predict where/when/which segments are going to be masked. In the satellite cases I mentioned above, we could have an equal number of masked pixels per segment, for example the SwathDefinition case due to a repeating invalid pixel pattern. Or we could have the first segment have a lot of invalid pixels, for example the geostationary area due to space pixels in the corner, but most of the later segments would have little to no masked pixels. Or we could have a lot of masked pixels in later segments depending on what sub-set of the geostationary area is being targeted.

SwamyDev commented 1 year ago

The off-the-Earth case is decently common for geostationary weather satellites where they see the "full disk" of the Earth, but have rectangular scanning patterns so the corners of their data arrays are all space pixels (invalid).

Ah, I didn't think of geostationary satellites. I mostly work with sensors in the 500-100km altitude range usually with high resolution which lead to this issue. It is a testament to the usefulness of this package that it can be applied to such broad applications :).

What is .size_valid?

It doesn't exist yet, I thought of it as a short hand for a potential function that could somehow return the amount of valid pixels in a target area. But if the off-earth pixels is a common case in your application this is probably not easily implementable. For the target masked arrays one could return the number of unmasked pixels, but it is probably not worth the extra effort then, if the swath to swath case is not that common anyways.

I'm starting to lean towards a full pre-allocation the size of the target area and remove any unsed rows at the end. I don't think we can predict where/when/which segments are going to be masked. In the satellite cases I mentioned above, we could have an equal number of masked pixels per segment, for example the SwathDefinition case due to a repeating invalid pixel pattern. Or we could have the first segment have a lot of invalid pixels, for example the geostationary area due to space pixels in the corner, but most of the later segments would have little to no masked pixels. Or we could have a lot of masked pixels in later segments depending on what sub-set of the geostationary area is being targeted.

Yes, considering all that, I would agree and also lean towards a full pre-allocation with a later trimming. Do you have any further suggestions or concerns? Otherwise I would implement this.

djhoese commented 1 year ago

It doesn't exist yet, I thought of it as a short hand for a potential function that could somehow return the amount of valid pixels in a target area. But if the off-earth pixels is a common case in your application this is probably not easily implementable. For the target masked arrays one could return the number of unmasked pixels, but it is probably not worth the extra effort then, if the swath to swath case is not that common anyways.

In general, I (we?) have been trying to avoid calculating things that require looking at every data point. These types of calculations are difficult or wasteful or don't perform well when using libraries like dask where only chunks of data are operated on at a time. Additionally, it really messes with CPU/memory cache locality that we have to look at every pixel to calculate something, then load it (inputs) all again to build the kdtree, then load the target all again to compute valid indexes, etc. We've gotten used to replacing invalid data with NaNs and just deal with the memory/computation hit for processing these extra pixels with the gain that we can easily predict sizes of input and output arrays and divide work up across threads/processes/nodes.

I'll bring up this PR again on the pytroll slack and see if the other maintainers have an opinion. If you are bored then feel free to implement this, but no need to rush on it otherwise.

djhoese commented 1 year ago

However, I suppose it would be easiest to get their feedback if the code was finished...

SwamyDev commented 1 year ago

Yeah, the issue with loading all the shards to calculate some total makes a lot of sense to me when you're intensively use something like dask - We do usually as well, but this specific case I'm working on right now requires some "special attention". Funnily enough, we came to the same conclusion regarding NaNs as invalid data, it really makes things much easier, so no argument me here ;).

Alright, then I'll implement it, shouldn't take too long anyways.