BUG: dtype mismatch error in vg.retarded_local_clustering()

pik-copan / pyunicorn

Unified Complex Network and Recurrence Analysis Toolbox

http://pik-potsdam.de/~donges/pyunicorn/

Other

195 stars 87 forks source link

BUG: dtype mismatch error in vg.retarded_local_clustering() #163

Closed pucicu closed 1 year ago

pucicu commented 2 years ago

When calling vg.retarded_local_clustering() a dtype mismatch error occurs:

ValueError: Buffer dtype mismatch, expected 'ADJ_t' but got 'short'

(the same for vg.advanced_local_clustering())

lenas95 commented 2 years ago

Hi @pucicu ,

could you please elaborate on which data you used and how your code snippet looks like, so that we can reproduce it?

Greetings

pucicu commented 2 years ago

The analysis here https://zenodo.org/record/5578298 worked in the past, but now pyunicorn now throws the error ValueError: Buffer dtype mismatch, expected 'ADJ_t' but got ‘short’

Simply use the data and analysis from this repository and let me know if it works for you.

Thanks and best wishes Norbert

Am 24.06.2022 um 13:36 schrieb lschmidt @.***>:

Hi @pucicu https://github.com/pucicu ,

could you please elaborate on which data you used and how your code snippet looks like, so that we can reproduce it?

Greetings

— Reply to this email directly, view it on GitHub https://github.com/pik-copan/pyunicorn/issues/163#issuecomment-1165487732, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHNNEGPBTXJN2N4VKD24BTVQWMTPANCNFSM5WB77YAQ. You are receiving this because you were mentioned.

fkuehlein commented 1 year ago

Update:

Problem appears to be within cython modules

error message:

src/pyunicorn/timeseries/visibility_graph.py in retarded_local_clustering(self)
    254         norm = retarded_degree * (retarded_degree - 1) / 2.
    255 
--> 256         _retarded_local_clustering(N, A, norm, retarded_clustering)
    257         return retarded_clustering
    258 

src/pyunicorn/timeseries/_ext/numerics.pyx in pyunicorn.timeseries._ext.numerics._retarded_local_clustering()

ValueError: Buffer dtype mismatch, expected 'ADJ_t' but got 'short'

fkuehlein commented 1 year ago

problem:

pyunicorn/timeseries/visibility_graph.py in self.retarded_local_clustering()

(line 256) error when calling _retarded_local_clustering(N, A, norm, retarded_clustering):
- A has dtype int16 (short in C) instead of expected ADJ_t (~int8 as defined in pyunicorn.core._ext.types)
- norm and retarded_clustering have dtype float64 (double in C) instead of expected FIELD_t (~float 32 as defined in pyunicorn.core._ext.types)
same problem with self.advanced_local_clustering()

quick fix/workaround

pyunicorn/timeseries/visibility_graph.py in self.retarded_local_clustering()

(line 256) instead of _retarded_local_clustering(N, A, norm, retarded_clustering) do _retarded_local_clustering(N, A.astype("int8"), norm.astype("float32"), retarded_clustering.astype("float32"))
(line 277) same for self.advanced_local_clustering()

thorough fix:

_[resolved] dtype of norm and retarded_clustering:_

initialize as np.zeros(self.N, dtype="float32") within VisibilityGraph (lines 219 and 244)

[not resolved] dtype of A:

pyunicorn/timeseries/visibility_graph.py in self.__init__()

(line 99) vg.visibility_relations() returns adjacency matrix A of dtype int8
(line 104) A is then passed to InteractingNetworks.__init__(self, A), which returns self.adjacency of dtype int16

pyunicorn/core/network.py in self.set_adjacency()

(line 482) dtype of adjacency matrix is set to int16 (or int32 if N is bigger)
why is that, when adjacency matrix should be binary?

ntfrgl commented 1 year ago

Thank you for the detailed analysis, @fkuehlein! You have identified two oversights in 3dab5bf89d2e224fc319ddd64aeeecc480f27fba.

The easy thing first: The original types for the arguments norm and retarded_clustering were the deprecated FLOATTYPE=np.float, which should have been translated as DFIELD=np.float64 instead of FIELD=np.float32 in the Cython function signature. I have fixed that now for all such cases in 2b489da9e504dfbdb621848bb82388f510c046ca.

For the argument A, the appropriate solution is to downcast to ADJ_t at the call site, using the helper core._ext.types.to_cy() that exists for this purpose. I will leave that to you, and kindly ask you to add a test that triggers all possible scalar types for the adjacency matrix, in a similar fashion to test_generic.simple_instances(). Please look out for other places where a similar pattern might occur.

There are several layers of causation to this bug, which I document here in hopes that it will inform the resolution and mitigation of other similar bugs:

The offending commit included a large number of type conversions in an attempt to make the Python/Cython interface more coherent. This involved avoiding officially deprecated type aliases, making explicit many implicit type casts, and unifying previously inconsistent type uses throughout the codebase via semantic type aliases.
Because of a significant lack of unit tests, in many cases these changes could not be automatically tested. This requires a systematic effort to extend the code coverage of unit tests to at least trigger all methods published in tutorials and in papers by the core authors.
Furthermore, these changes have so far not been carefully reviewed due to a lack of resources. @fkuehlein, given the code analysis you have already done for this issue, it would be wonderful if you could extend your review to the entire associated change set.
As you explained, the dynamically changing numeric type of sparse adjacency matrices is the specific issue underlying this bug. As far as I can tell, this is a choice from early on in the library's development and is motivated by some operations on sparse matrices, e.g., row-wise or column-wise sums, which would otherwise result in overflow w.r.t. the scalar type holding the binary values of the sparse matrix. Of course, circumventing the overflow problem by upcasting the binary values is a very resource-inefficient and not a failsafe solution. Ideally, the adjacency should always be represented using the smallest possible type and potentially overflowing sparse matrix operations should be rewritten, or at least implemented using only local and temporary upcasts. But this would constitute a significant change to the core of the library and would require extensive testing, reinforcing the point above.
More generally, this issue is a prime example for the general need to either test much more rigorously or to rewrite the library in a strongly typed way. The offending commit attempted to find a compromise with very limited resources, by forcing type casts to fail only at the Python/Cython interface when they don't conform to some "standard" behaviour that was encoded explicitly in the semantic type aliases, under the assumption that no dynamic type changes of significance will occur inside the Cython functions.

fkuehlein commented 1 year ago

Thanks for your help and the detailed additional info, @ntfrgl!

The original issue should be resolved as of 1651a2b. For now, that commit remains on my fork and I will create a pull request as soon as I have cleared at least the rest of VisibilityGraph of similar possible occurrences.

I have two questions remaining in order to proceed on this:

Appropriate solution for these cases

As I have understood, the minimal solution (which you have pursued so far) would be to generally pass arguments arg of a cython module _ext.numerics._mod(arg) with help of _ext.numerics._mod(to_cy(arg, ARGTYPE)) to make sure they are cast to the expected type ARGTYPE (as is done throughout core.Network).

All remaining variables that might be affected by this within VisibilityGraph, however, are already explicitly set to the required dtype on initialization. Namely:
- The empty placeholder for the adjacency matrix A, which is correctly initialized to be of dtype=MASK. I guess I keep that as it is.
- VisibilityGraph.time_series and VisibilityGraph.timings, which are initialized to be of dtype=np.float32. Keep that, or put dtype=FIELD there instead?
Implementation of a respective test

I'm not sure what you mean by "trigger all possible scalar types for the adjacency matrix". self.adjacency is a property of the Network class. So maybe the idea is to minimally initialize all classes that inherit from Network and check what scalar types they return for self.adjacency. Does that make sense? Or else, could you elaborate on that?

ntfrgl commented 1 year ago

Short-term answer

Solution:
- Apart from mistakes like the ones above, the original commit introducing semantic type aliases followed a manual, defensive approach. The goal was to use typed initialisation where it is apparently valid (i.e., the variable is initialised within the current scope and no type-altering methods are applied before the Cython call), and to type cast directly at the call site otherwise.
- This means that most of the work should already be done, but it is expected that a few mistakes not triggered in the current test suite need to be fixed, and that some unnecessary copy/cast operations might be avoidable with a more precise type inspection.
- It would be preferable to replace the remaining non-alias types with the semantic aliases where possible.
Testing:
- In the current bug's case, the problematic type mutation occurs depending on the node count, which means that in order to guard against future instances of similar bugs, the test for this method should be parametrised with several different network sizes to ensure that all possible cases of Cython calls are covered.
- The reliability and reusability of the test suite would be improved if this was done using parametric test fixtures, which exist for such scenarios.
- And btw, if in the process of looking for other similar situations you find it helpful to search for ADJ_t, then the semantic type aliases are fulfilling part of their purpose.

Long-term answer (worth discussing and creating a new issue for)

Solution:
- The above is really an incomplete substitute for a proper type analysis, which has traditionally not been available in the dynamically typed Python language, and is held together by non-mechanised reasoning and incomplete testing. Given recent developments in the Python language and its library ecosystem, a much more reliable and efficient approach would be to use the new NumPy API type annotations. These could be introduced incrementally throughout the codebase, and would allow static type checking of the numerical types. I expect that this will be straightforward for the most part with few exceptions, that it might expose some well hidden bugs, and that it would make the library a lot more sustainable.
Testing:
- In addition to the above extensions of the unit tests, Mypy would execute the static type checking in combination with the plugin for NumPy API types.

fkuehlein commented 1 year ago

Thanks a lot for your help and elaborations.

I fixed the above case accordingly in VisibilityGraph. I also looked through all of pyunicorn.timeseries and found no more conflicting types of the adjacency matrix, they're all safely cast to ADJ already.

I will therefore make a pull request now so the original issue can be fixed.

Then I will look into creating an appropriate test for these cases. Thanks for the hint to the existence of parametric test fixtures! Maybe we can briefly discuss your 'long-term anser' next week with @jdonges' help.

fkuehlein commented 1 year ago

Ok, actually found another dtype=np.uint8 in timeseries._ext.numerics._visibility(). The function in question is only used within test_timeseries.testVisibility() though, which is currently commented out.

Fixed both the test and the cython function with 91836d4 for a start, on a new branch. Still, the test only covers a specific testing method, so test coverage is lacking (let alone parametrisation). I will therefore further look into enhancements.

Putting this here just for documentation.

ntfrgl commented 1 year ago

Let me add to the critical observations in your latest commit message, since it doesn't have an associated issue yet:

_visibility(): The core of this function is a loop with a 1-bit accumulator, and should not be allocating an array and calling several NumPy methods on it.
testVisibility(): A test which essentially replicates the code under test is almost useless.

ntfrgl commented 1 year ago

Resolved by #183.