pwollstadt / IDTxl

The Information Dynamics Toolkit xl (IDTxl) is a comprehensive software package for efficient inference of networks and their node dynamics from multivariate time series data using information theory.
http://pwollstadt.github.io/IDTxl/
GNU General Public License v3.0
243 stars 77 forks source link

Stuck with "source must be an integer numpy array" bug in PID #70

Closed aleksejs-fomins closed 3 years ago

aleksejs-fomins commented 3 years ago

When running MultivariatePID estimator, I get the following bug

File "/home/alyosha/Downloads/IDTxl/idtxl/estimators_multivariate_pid.py", line 167, in _check_input
    raise TypeError('Input s{0} (source {0}) must be an integer numpy '
TypeError: Input s1 (source 1) must be an integer numpy array.

Sadly I am unable to replicate bug with the minimal example. Here is the minimal example of what I am trying to do

# Import classes
import numpy as np
import pandas as pd
from idtxl.multivariate_pid import MultivariatePID
from idtxl.data import Data

dataNumpy = np.random.randint(0, 4, (3967, 27, 1))
data = Data(dataNumpy, 'rps', normalise=False)

pid = MultivariatePID()
settings_SxPID = {'pid_estimator': 'SxPID', 'lags_pid': [0, 0, 0]}
results_SxPID = pid.analyse_single_target(settings=settings_SxPID, data=data, target=3, sources=(0, 1, 2))
results_SxPID.get_single_target(3)['avg']

I have so far:

What am I missing?

Abzinger commented 3 years ago

Just to check that we are on the same page: dataNumpy = np.random.randint(0, 4, (3967, 27, 1)) data = Data(dataNumpy, 'rps', normalise=False)

'rsp' means that you have 3967 replications, 27 processes and 1 sample. I'm not sure if you meant to have one sample per replication.

Could it be the _check_input is complaining that this is a single point in the samples? Not sure just spit balling ;)

P.S. I tried your script and it worked. See below the print out (I replaced the last line by print(results_SxPID.get_single_target(3)['avg'])')

Adding data with properties: 27 processes, 1 samples, 3967 replications overwriting existing data {((1,),): (0.2476656862520551, 0.24607925544106224, 0.0015864308109929143), ((2,),): (0.24793261350356333, 0.24779678052787935, 0.00013583297568396408), ((3,\ ),): (0.247689225543444, 0.2462293130099496, 0.001459912533494354), ((1, 2),): (0.29220754752122313, 0.28964287621002227, 0.002564671311200752), ((1, 3),): (\ 0.29209258144036937, 0.2891251698786944, 0.0029674115616751106), ((2, 3),): (0.2917557182351069, 0.2888069078560238, 0.0029488103790827687), ((1, 2, 3),): (0\ .8932421388524286, 0.874174272951998, 0.01906786590043029), ((1,), (2,)): (0.4022225644074161, 0.40230874270205663, -8.617829464024637e-05), ((1,), (3,)): (0\ .4022572799304044, 0.4028430589511877, -0.0005857790207830953), ((1,), (2, 3)): (0.1572956243376338, 0.15752912133024555, -0.00023349699261174833), ((2,), (3\ ,)): (0.40155467460702093, 0.4010003414904268, 0.0005543331165942527), ((2,), (1, 3)): (0.15720685212840368, 0.15707541183717216, 0.00013144029123148703), ((\ 3,), (1, 2)): (0.15724848955315024, 0.15740901803317575, -0.00016052848002539878), ((1, 2), (1, 3)): (0.26591816740297625, 0.2640435177112699, 0.001874649691\ 706341), ((1, 2), (2, 3)): (0.26663439000495925, 0.2670846887533893, -0.0004502987484300561), ((1, 3), (2, 3)): (0.2663380259184582, 0.26493835426047396, 0.0\ 013996716579842122), ((1,), (2,), (3,)): (0.7905066073476456, 0.7904319828757621, 7.462447188370683e-05), ((1, 2), (1, 3), (2, 3)): (0.20905590444451516, 0.2\ 0953758316873955, -0.00048167872422443805)}

aleksejs-fomins commented 3 years ago

Yes, 1 sample is intended. Perhaps its clumsy, I should just delete that dimension and work with 'rp' instead of 'rps'.

Yes, the minimal example is working, that is what I wrote. It is my best effort to construct a minimal example at the moment. I would be happy to provide you with a minimal example that does not work, but to the best of my understanding this is the minimal example and I don't understand why it works and my code does not.

If possible, could you please explain to me what exactly is "line 167" in "estimators_multivariate_pid.py" checking for, and how can that test fail if the input is numpy array of type int64?

mwibral commented 3 years ago

Hi Aleksejs,

1 sample would mean taht per experiment you only get a single time point, but you have run 3000+ experiments or epochs - is that really correct?

Michael

On 08.07.21 11:06, Aleksejs Fomins wrote:

1 sample is intended

Abzinger commented 3 years ago

I honestly need more content. Line 167 simply checks (as you understood correctly) whether the type of dtype of each source (i.e. process) is part of numpy integers.

Your verification might not be enough since it is too far from the if statement and probably something in between went wrong. I would do step by step debugging (maybe print debugging too). To be crystal clear is to print right before the if statement issubclass(s[i].dtype.type, np.integer) and s[i].dtype.type, separately.

Otherwise you need to give more info so that I can help you. The only thing that I could think of is that the data is appropriately assigned but this just a speculation.

aleksejs-fomins commented 3 years ago

Michael: Yes, that is exactly what I mean. For example: I have a short reward phase (part of the trial where a mouse gets or does not get a water reward), e.g. 10 timesteps. During this phase, calcium signal is highly autocorrelated. So, instead of using 10 timesteps, I average them out, and now I have only 1 timestep, but well-behaved. Now, the mouse does the trial 3000+ times, hence the shape of the input.

aleksejs-fomins commented 3 years ago

@Abzinger Here is a part of my actual code

    print(settings)
    print("Check1", dataEff.shape, dataEff.dtype, src, trg)
    print('Check2', issubclass(dataEff.dtype.type, np.integer))
    print('Check3', [issubclass(dataEff[:, i].dtype.type, np.integer) for i in src])
    print('Check4', issubclass(dataEff[:, trg].dtype.type, np.integer))

    dataIDTxl = Data(dataEff, dim_order='rps')
    pid = MultivariatePID()

    rez = pid.analyse_single_target(settings=settings, data=dataIDTxl, target=trg, sources=src)

And the output below. To the best of my understanding, I have checked that the data is of type numpy integer as you have suggested, and it indeed is. I am a bit puzzled about how to continue debugging this.

{'pid_estimator': 'SxPID', 'lags_pid': [0, 0, 0]}
Check1 (3967, 27, 1) int64 [0, 1, 2] 3
Check2 True
Check3 [True, True, True]
Check4 True
Adding data with properties: 27 processes, 1 samples, 3967 replications
overwriting existing data

Traceback (most recent call last):
  File "/home/alyosha/work/git/pub-2020-exploratory-analysis/analysis-gallerosalas-raw/extern/multiscale-pid-joint4D.py", line 38, in <module>
    pid_multiprocess_mouse(dataDB, mc, h5outname, argSweepDict, exclQueryLst, metric='MultivariatePID',
  File "/home/alyosha/work/git/pub-2020-exploratory-analysis/lib/analysis/pid_multiprocess.py", line 105, in pid_multiprocess_mouse
    rezIdxs, rezVals = pid.pid(dataLst, mc, metric=metric, dim=dim, nBin=nBin,
  File "/home/alyosha/work/git/pub-2020-exploratory-analysis/lib/analysis/pid_common.py", line 88, in pid
    rez = mc.metric3D(metricName, '',
  File "/home/alyosha/work/git/mesostat-dev/mesostat/metric/metric.py", line 265, in metric3D
    rez = sweepGen.unpack(self.mapper.mapMultiArg(wrappedFunc, sweepGen.iterator()))
  File "/home/alyosha/work/git/mesostat-dev/mesostat/utils/parallel.py", line 43, in mapMultiArg
    return self.map(f_proxy, x)
  File "/home/alyosha/work/git/mesostat-dev/mesostat/utils/parallel.py", line 33, in map
    rez = self.map_func(f, x)
  File "/home/alyosha/work/git/mesostat-dev/mesostat/utils/parallel.py", line 15, in <lambda>
    self.map_func = lambda f,x: list(map(f, x))
  File "/home/alyosha/work/git/mesostat-dev/mesostat/utils/parallel.py", line 42, in <lambda>
    f_proxy = lambda task: f(*task)
  File "/home/alyosha/work/git/mesostat-dev/mesostat/metric/metric.py", line 262, in <lambda>
    wrappedFunc = lambda data, settings: metricFunc(data, {**settings, **metricSettings})
  File "/home/alyosha/work/git/mesostat-dev/mesostat/utils/decorators.py", line 63, in inner
    rez = func(*args, **kwargs)
  File "/home/alyosha/work/git/mesostat-dev/mesostat/metric/idtxl_pid.py", line 116, in multivariate_pid_4D
    rez = pid.analyse_single_target(settings=settings['settings_estimator'], data=dataIDTxl, target=trg, sources=src)
  File "/home/alyosha/Downloads/IDTxl/idtxl/multivariate_pid.py", line 200, in analyse_single_target
    self._calculate_pid(data)
  File "/home/alyosha/Downloads/IDTxl/idtxl/multivariate_pid.py", line 281, in _calculate_pid
    orig_pid = self._pid_estimator.estimate(
  File "/home/alyosha/Downloads/IDTxl/idtxl/estimators_multivariate_pid.py", line 78, in estimate
    s, t, self.settings = _check_input(s, t, self.settings)
  File "/home/alyosha/Downloads/IDTxl/idtxl/estimators_multivariate_pid.py", line 167, in _check_input
    raise TypeError('Input s{0} (source {0}) must be an integer numpy '
TypeError: Input s1 (source 1) must be an integer numpy array.

Process finished with exit code 1
aleksejs-fomins commented 3 years ago

Ok, I have saved the variable dataEff into a file, then tried running the minimal example by loading this variable, and it works. So the code with exactly same inputs works separately from my code, but does not work inside of it. While I still don't know what it is, I suspect that it has nothing to do with IDTxl. Its probably one of those glitches when python reports the wrong bug somehow. I am sorry to trouble you

Abzinger commented 3 years ago

Good that it works now. No worries, I think we will close this issue now :)

pwollstadt commented 3 years ago

Thanks everyone for clearing this up! I will close the issue.

aleksejs-fomins commented 3 years ago

Dear all,

After further testing, I have been able to construct a minimal example of a bug, and now I suspect it is indeed a bug of IDTxl.

Here is a minimal example that WORKS:

import numpy as np
from idtxl.multivariate_pid import MultivariatePID
from idtxl.data import Data

dataOrig = np.random.randint(0, 4, (1209, 27, 1))

data = Data(dataOrig, 'rps', normalise=False)

pid = MultivariatePID()
settings_SxPID = {'pid_estimator': 'SxPID', 'lags_pid': [0, 0, 0]}
results_SxPID = pid.analyse_single_target(settings=settings_SxPID, data=data, target=3, sources=(0, 1, 2))
print(results_SxPID.get_single_target(3)['avg'])

Here is a minimal example that DOES NOT WORK

import numpy as np
from idtxl.multivariate_pid import MultivariatePID
from idtxl.data import Data

def myfunction(data, settings):
    dataIDTxl = Data(data, dim_order='rps')
    pid = MultivariatePID()
    rez = pid.analyse_single_target(settings=settings,
                                    data=dataIDTxl, target=3, sources=(0,1,2))

    return rez.get_single_target(3)['avg']

dataOrig = np.random.randint(0, 4, (1209, 27, 1))
settings = {'pid_estimator': 'SxPID', 'lags_pid': [0, 0, 0]}

print(myfunction(dataOrig, settings))

For whatever reason, MultivariatePID class seems to disbehave when wrapped into a function. Any advice is appreciated

Abzinger commented 3 years ago

Hi Aleksejs,

This is a problem is with the data within myfunction and not because it is wrapped in a function.

I tried the example that doesn't work and indeed it didn't work. However I changed the first line in myfunction from dataIDTxl = Data(data, dim_order='rsp') to dataIDTxl = Data(data, 'rsp', normalise=False). Then it worked. This is because the problem stems from not setting the normalise option to be False. The option normalise is by default true, i.e. the data is z-transformed and its type is transformed into a float.

Tbh I don't recall why the sources and targets are chosen to be integers but maybe @pwollstadt remembers why :grimacing:

Cheers, Abed

aleksejs-fomins commented 3 years ago

Uff, I'm blind, I totally missed that there was a difference in normalization parameter. Thanks so much for checking. I'll try to run again and get back to you

mwibral commented 3 years ago

Hi all,

maybe we should mandatorily require the normalize parameter to be set in the creation of the data object, as it does not make sense at all, if one wants to analyze discrete data. I also totally missed that.

Michael

On 13.07.21 15:55, Abdullah Makkeh wrote:

Hi Aleksejs,

This is a problem is with the data within |myfunction| and not because it is wrapped in a function.

I tried the example that doesn't work and indeed it didn't work. However I changed the first line in |myfunction| from |dataIDTxl = Data(data, dim_order='rsp')| to |dataIDTxl = Data(data, 'rsp', normalise=False)|. Then it worked. This is because the problem stems from not setting the normalise option to be False. The option normalise is by default true, i.e. the data is z-transformed and its type is transformed into a float.

Tbh I don't recall why the sources and targets are chosen to be integers but maybe @pwollstadt https://github.com/pwollstadt remembers why 😬

Cheers, Abed

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pwollstadt/IDTxl/issues/70#issuecomment-879109919, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFJQGW2B3NEFTVQLFZ4NNLTXRAVJANCNFSM477ANM7Q.