tobac-project / tobac

Tracking and object-based analysis of clouds
BSD 3-Clause "New" or "Revised" License
101 stars 53 forks source link

`filter_min_distance` behavior depends on feature number #230

Closed freemansw1 closed 1 year ago

freemansw1 commented 1 year ago

From @snilsn: A theoretical case:

Imagine 3 features with increasing threshold values. Feature 1 + Feature 2 and Feature 2 + Feature 3 are closer than the minimum distance, but Feature 1 + Feature 3 are not. If (1, 2) comes before (2, 3) in the indice list, both Feature 1 and Feature 2 are deleted. Is this a potential problem?

_Originally posted by @snilsn in https://github.com/tobac-project/tobac/pull/209#discussion_r1052399898_

w-k-jones commented 1 year ago

I think anything that changes the output depending on the order of operations is a potential problem.

It would be good to clarify here what we like to have as the ideal output. Should all three features be removed? Should only the lower threshold value feature be removed when closer than the minimum distance, rather than both? In this hypothetical case, should we aim to remove feature 2 (the middle feature), leaving 1 and 3 as valid, or remove all 3?

A possible quick fix to prevent changes due to order of operations would be to calculate all the minimum differences between features before removing any

w-k-jones commented 1 year ago

Note that we could likely improve both the functionality and performance of the nearest neighbour search by using a KDTree method to look up nearest neighbours: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html

snilsn commented 1 year ago

I found the scipy pairwise distance function in the meantime, which could (with a bit of masking an sorting) probably be used to make this independet of the order of operations. The idea of @w-k-jones with a KDTree is probably ab bit more efficient though, this would be a light version of the things trackpy does under the hood.

The idea of min_distance is to remove features, that actually belong to a close, more pronounced feature, correct? Then I think that removing 1 and 2 in the example is probably the right way to go, if we assume that 1 actually belongs to 2 and 2 actually belongs to 3

w-k-jones commented 1 year ago

I agree, both 1 and 2 should be removed

Also having looked into this I've noticed a number of problems and bugs linked to this in feature detection:

  1. filter_min_distance will always select the feature with the larger threshold value, even if the target is "minimum"!
  2. filter_min_distance is called within a separate loop in feature_detection_multithreshold, when it would be better to call it within feature_detection_multithreshold_timestep
  3. min_distance is passed to feature_detection_threshold without being used

At present, the filtering selects the feature that meets the largest threshold value, and if they are equal then the feature with the largest number of pixels. If both have the same number of pixels then the second is removed. Do we want to keep it this way, or change it by, for example, using the min/max value of the field within the feature rather than the threshold?

I think the best way to approach this will be to merge the 3D changes as is, then fix this issue afterwards. I'll have a look into how best to do this, it looks like KDTree's query_pairs method will be the best way to go

w-k-jones commented 1 year ago

Now that the quick addition of the target option is mostly done in #244 , I've started looking into implementing a KDTree approach to resolve the rest of this issue. Two further issues have come to light while doing this:

  1. Currently filter_min_distance solely uses cartesian distance based off of hdim_1, hdim_2 and zdim and the dxy and dz values. Should we switch to using the projection_x_coordinate and projection_y_coordinate by default (removing the need to pass dxy), and should we also allow the use of lat/lon coordinates and distances instead of cartesian?

  2. @freemansw1 @galexsky has filter_min_distance been modified for periodic boundary conditions in the v1.5.0 release? If not I think this is possible with the KDTree approach, but will have to look into it

freemansw1 commented 1 year ago

I've moved this to the v1.5 milestone and will close when #249 is merged.

freemansw1 commented 1 year ago

Closing this now that #249 is merged.