RQA Problems regarding white_vertline_dist

madarax64 commented 2 years ago

Hello, I recently installed PyUnicorn from Git (i.e using pip3 install git+https://.....). Afterwards, I was able to import the package as expected. It also worked for some datasets I was using. However on particular datasets, I get the following error:

"""
Traceback (most recent call last):
  File "/home/madarax64/.conda/envs/pyrqa/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "comp.py", line 49, in extract_feature
    features.append(rp.average_white_vertlength())
  File "/home/madarax64/.conda/envs/pyrqa/lib/python3.7/site-packages/pyunicorn/timeseries/recurrence_plot.py", line 1420, in average_white_vertlength
    white_vertline = self.white_vertline_dist()
  File "/home/madarax64/.conda/envs/pyrqa/lib/python3.7/site-packages/pyunicorn/timeseries/recurrence_plot.py", line 1381, in white_vertline_dist
    _white_vertline_dist(n_time, white_vertline, R)
  File "src/pyunicorn/timeseries/_ext/numerics.pyx", line 785, in pyunicorn.timeseries._ext.numerics._white_vertline_dist
IndexError: Out of bounds on buffer access (axis 0)
"""

Notably, RQA works on the same dataset when I run it on a Windows machine using PyUnicorn 0.6.1 (there's a prebuilt wheel available from here that works for me), so it seems to be an issue in PyUnicorn 0.7 specifically. Can you kindly let me know how to fix this?

Thanks!

zugnachpankow commented 2 years ago

Hey @madarax64,

I recently joined the pyunicorn team and will try my best to help you. I checked what changed in ._white_vertline_dist since 0.6.1 and the only thing i came across was the ctypedef that before directly happened in timeseries/_ext/numerics and is now imported from core/_ext/types.

Could you maybe create a minimally complete example that reproduces the error and post it (tagging me)?
What do you mean by particular datasets? Maybe it is helpful if you can provide some more info.

madarax64 commented 2 years ago

Hello @MaxBechthold , Thanks for the response. I have attached such an example at the bottom of this post. You'll need to install pyts to replicate it, however. For reference, I install PyUnicorn directly from this Github repo (i.e. by pip3 install git+https://github.com/pik-copan/pyunicorn) and I am running in a Linux environment.

I am working on a classification problem using the RQA measures available from the PyUnicorn library as features to a machine learning classifier. I am using 128 publicly-available datasets (the UCR 2018 dataset archive), which I am sourcing from the PyTs library mentioned above. As you'll see from the example code, it is very easy to load the data for any of these datasets by name. I encounter this issue when I use the ACSF1 dataset from that repository, but not when I use others e.g the Ham dataset, which is what I meant by "particular datasets".

As mentioned previously, I have a PyUnicorn 0.6.1 installation on a Windows machine (using the wheel I linked to it in the first post above), and I do not have this issue in that environment, suggesting that it is a problem either with PyUnicorn 0.7 or the Linux environment.

Thanks

----BEGIN Example----

import numpy as np
from pyts.datasets import fetch_ucr_dataset
from pyunicorn.timeseries.recurrence_plot import RecurrencePlot

def loadData(dataset):
    dataDict = fetch_ucr_dataset(dataset)
    return dataDict['data_train'], dataDict['data_test']

def extract_feature(x):
    x_ = x[~np.isnan(x)]
    features = []
    rp = RecurrencePlot(x_, recurrence_rate=0.1, dim=1, tau=1, metric='supremum',silence_level=2)
    features.append(rp.average_diaglength())
    features.append(rp.average_vertlength())
    features.append(rp.average_white_vertlength()) #this is the offending feature
    features.append(rp.determinism())
    features.append(rp.diag_entropy())
    return features

if __name__ == "__main__":
    srcDatasetName = 'ACSF1' #ACSF1 fails, Ham works
    x_train, x_test  = loadData(srcDatasetName) #load the data from the 2018 UCR TS archive
    extracted_features = []
    for sample in x_train:
        extracted_features.append(extract_feature(sample))`

----End Example----

zugnachpankow commented 2 years ago

Hey @madarax64, thanks for getting back to me. I will try to replicate the problem in the next days and let you know.

madarax64 commented 2 years ago

Sure. Thanks alot!

zugnachpankow commented 2 years ago

Hey @madarax64, so I was actually able to reproduce your error. Unfortunately I was not able to fix it or find a workaround... The only thing that changed for said method was the type conversion between C and python. Restoring the state from version 0.6.1. did not fix the problem though, which leaves me thinking that this might be some interaction problem with cython. Sadly I have not expertise in C and with cython yet to solve this problem.

If 0.6.1 works with those datasets for you I think the best idea is to actually keep using it for this particular sets. Would that work for you for now? I hope we can solve this with further commits in the future.

madarax64 commented 2 years ago

Hi @zugnachpankow , Thanks for the speedy response! I'm not familiar with Cython myself, but I feel you're probably right.

I will try to workaround the issue as you've suggested. Hopefully a future commit will fix it. In the meantime, is it ok if I leave the issue open i.e. so it remains visible? Thanks again!

zugnachpankow commented 2 years ago

Agreed, let's leave this issue open. And thanks for pointing this problem out!

madarax64 commented 2 years ago

Thanks @zugnachpankow !

ntfrgl commented 1 year ago

This appears to be a logical bug of the off-by-one type in the implementation of RecurrencePlot.white_vertline_dist(): The array white_vertline for the return value is initialised with length equal to the size of the recurrence matrix, which does not cover the edge case in timeseries._ext.numerics._white_vertline_dist‎() that an entire column might be white, i.e., k = n_time. In other words, the allocated histogram does not cover the full event space.

This bug has gone unnoticed until the Cython compiler directive boundscheck was activated in 79ded90816f49274ab6ac91febe50e4dad7b710a.

@fkuehlein, could you please write a test exposing this edge case and investigate whether downstream uses of this method, and related ones, would have any issues if the return value had a length increased by one?

fkuehlein commented 8 months ago

Thank you @ntfrgl for specifying this.

You're right about the off-by-one type of the bug, although I found that an index shift would do the job while sparing us any downstream complications of changing the length of white_vertline:

Certainly, the edge case of an all-white column in the RP should be accounted for. Instead, I see no sense in counting lines of length 0 and indeed there are none ever counted, which is why the first histogram entry is practically unused. Therefore, I'd propose to just shift the indexing from vertline[k] to vertline[k-1]. At my first sight, this should induce no downstream complications, and it makes more sense mathematically. Would you agree, or should we ask some RQA pro here first?
I also found that the same problem occurs in other places, which would have to be changed accordingly in order to be consistent:

The same applies to RecurrencePlot.vertline_dist() and its helper methods, producing the same error. The same also applies to RecurrencePlot.diagline_dist(), although no error occurs here as the index will only exceed bounds on the last loop (the main diagonal), where it is not used anymore, making that last loop pretty useless.

Note that the current behavior appears to be in line with the corresponding legacy Weave code (see 4644229, 4eb2478 and 5e54a36), so the problem was not in the Cython port here.

The edge cases can easily be triggered in a parametrized test including thresholds of 0 and a rather high value, producing an all-white or all-black RP respectively. Adding tests for all above methods is overdue anyway, RecurrencePlot's coverage is currently only at about 36%.

fkuehlein commented 8 months ago

... of course, the quicker fix would be to just ensure that the index k will not exceed the bounds of the indexed array.

Still, I think shifting the index as proposed above would be more sensible. But as this issue affects large parts of RecurrencePlot and the index shift would change the behavior of the respective methods a bit, I'll check back with Reik and Jonathan before deciding this.

In the meantime I implemented some tests against this issue, trying a parametrized test fixture this time. Although parametrized fixtures are cool for being reusable over multiple tests, I haven't found a proper way to vary multiple parameters as subsequent pytest.mark.parametrize()-statements would do. If you know one, let me know!

ntfrgl commented 8 months ago

Yes, changing the output semantics of an essential library method requires a discussion with stakeholders, and an update to the documentation where appropriate. Since the currently documented output is supposed to be a "frequency distribution of line lengths", I think that it would be prudent to either include the actual length-0 and length-N frequencies or to note that some of them are not computed. I would personally prefer the former, but it is up to the active users to make the call.

I have added a commit to the linked branch which suggests a possible factorisation of the fixture parametrisation.

fkuehlein commented 8 months ago

Awesome, parametrising fixtures that way is even neater than what I was hoping for. A good start for further improving test coverage, thanks.

Also agreed on the question of line-length frequencies, although I still don't quite understand how a length-0 line would be defined. In my eyes, starting to count at length 1 makes sense, right? Anyway, Reik has already said to look into it, so I'll see his response first.

fkuehlein commented 7 months ago

As noticed in #188, line distributions will also throw an IndexError with non-quadratic CrossRecurrencePlots. While basic RecurrencePlots are always quadratic, CrossRecurrencePlots don't need to be.

Hence, when fixing line distribution measures here, the non-quadratic case should be accounted for in the respective base class methods, or overriding methods must be implemented in the child class. Then, the testing of line distributions and use of test fixtures should be extended to CrossRecurrencePlot and all of test_timeseries.py.

jdonges commented 7 months ago

In my view, lines of length 0 are an ill-defined concept and cannot be counted (because a line of length is not a line). This is why I think an index shift may be a viable solution, when documenting it thoroughly. Lines of length N, while an edge case, can occur and should be accounted for.

@fkuehlein please make sure that statistics that rely on weighted line distributions such as Determinism are then still computed correctly.

On #188: here the non-quadtratic case for CrossRecurrencePlot can be accounted for by overriding the vertline_dist() and diagline_dist() methods.

fkuehlein commented 6 months ago

In the above commits 740f6ca and 00d8773, I implemented the agreed-upon index shift attempting not to corrupt or change the behavior of any downstream methods. I checked for correct results manually, but will try adding quantative assertions to the respective tests before opening a PR.

Among other things, in 14f822d I renamed the Cython line distribution helpers as I found current names to be misleading. Please let me know if I missed something there, or in the other commits.

ntfrgl commented 6 months ago

I obviously agree with the above approach, but let me briefly clarify how I would make sense of length 0.

A black line within an underlying linear grid might be mathematically defined in two ways, which correspond to slightly different automata for recognising them:

Filter out white cells, then form equivalence classes among the remaining black cells (reflexive and transitive closure of the relation neighbours), and finally count the sizes of the equivalence classes.
Form equivalence classes among all cells (reflexive and transitive closure of the relation neighbours which are both black), and then count the number of black cells within each equivalence class.

In the approach (1), length 0 cannot exist, whereas in (2), each white cell is an equivalence class of its own.

fkuehlein commented 6 months ago

Ah ok, I think I get the idea now!

pik-copan / pyunicorn

RQA Problems regarding white_vertline_dist #166