pik-copan / pyunicorn

Unified Complex Network and Recurrence Analysis Toolbox
http://pik-potsdam.de/~donges/pyunicorn/
Other
195 stars 87 forks source link

MAINT: review funcionality and use of `CouplingAnalysis.get_nearest_neighbors()` and add a specific test #199

Open fkuehlein opened 1 year ago

fkuehlein commented 1 year ago

CouplingAnalysis.get_nearest_neighbors() is currently only used as a helper method for CouplingAnalysis.mutual_information() and CouplingAnalysis.information_transfer() and tested indirectly through those. After having ported the underlying C method to Cython in #195, it appeared sensible to gain more confidence on its correct functionality by giving it a test of its own.

To create a test fixture and an expected result to compare to, it is essential to understand what the method is actually supposed to do. In trying that, I found that it has redundant loops and variables defined in several places that at least make it hard to read (see my comments here; the code appears to be adapted from a more generally applicable algorithm, but has lost its wider applicability anyway due to the adaptations).

I mostly grasped its functionality by now, but still don't really understand the special role of the z dimension given to it by the above mentioned methods it's used by. Other than that, here's what so far I found CouplingAnalysis.get_nearest_neighbors() to be currently doing:

given: $X = (x(t), y(t), z(t))$: an array of 3 timeseries with length $T$ $d_{xyz}$: an array to indicate, where each timeseries is located within the array (depending on each timeseries' dimensions) $k$: number of nearest neighbors to look for

NOTE: the dimension of $X_i$ is 1 for all use cases within CouplingAnalysis, except for $z(t)$, which will be either left empty in mutual_information(), or can be of dimension > 1 in information_transfer()

For all times $t = 1,...,T$:

  • find the $k$ times $t' = 1,...,T$ where in all timeseries $X_i(t')$ is closest to $X_i(t)$,
  • out of all those $k$ times $t'$, find the biggest of these distances within any timeseries as $\epsilon_{max}$
  • then, for all timeseries $X_i = x,y,z$:
    • count how many times that timeseries itself is within $\epsilon_{max}$ to $X_i$ (might be more (or also less?) often then $k$-times),
      although neighbors $t'$ within $x$ and $y$ are only counted, if $z$ has a neighbor at the same time $t'$

Still not sure if that's what it is supposed to be doing though. Probably the referenced papers Kraskov (2004) and Runge (2012b) should be consulted.