pwollstadt / IDTxl

The Information Dynamics Toolkit xl (IDTxl) is a comprehensive software package for efficient inference of networks and their node dynamics from multivariate time series data using information theory.
http://pwollstadt.github.io/IDTxl/
GNU General Public License v3.0
249 stars 76 forks source link

Binary ransfer entropy overflow error #65

Closed thosvarley closed 3 years ago

thosvarley commented 3 years ago

I am attempting to use the BivariateTE().analyse_single_target() function and it keeps throwing the following error. The dataset is large, but it works fine with MultivariateTE.analyse_single_target() function. It is binary neuron recordings w/ approximately 1,849,957 bins and 100-150 processes.

The full error: Traceback (most recent call last):

File "bte.py", line 55, in <module>

results = inference.analyse_single_target(settings, data, target=target)

File "/N/u/tvarley/Carbonate/.conda/envs/python3.7/lib/python3.8/site-packages/idtxl/bivariate_te.py", line 281, in analyse_single_target

`self._test_final_conditional(data)`

File "/N/u/tvarley/Carbonate/.conda/envs/python3.7/lib/python3.8/site-packages/idtxl/network_inference.py", line 753, in _test_final_conditional

[s, p, stat] = stats.omnibus_test(self, data)

File "/N/u/tvarley/Carbonate/.conda/envs/python3.7/lib/python3.8/site-packages/idtxl/stats.py", line 355, in omnibus_test statistic = analysis_setup._cmi_estimator.estimate(

File "/N/u/tvarley/Carbonate/.conda/envs/python3.7/lib/python3.8/site-packages/idtxl/estimators_jidt.py", line 588, in estimate

var1 = utils.combine_discrete_dimensions(var1, self.settings['alph1'])

File "/N/u/tvarley/Carbonate/.conda/envs/python3.7/lib/python3.8/site-packages/idtxl/idtxl_utils.py", line 283, in combine_discrete_dimensions

combined_values[t] = int(combined_value)

OverflowError: Python int too large to convert to C long `

jlizier commented 3 years ago

Hi Thomas,

So far as I can tell here, what's happening is that the bivariate TE algorithm is selecting many more sources than multivariate, and so then when the final omnibus test runs (which checks stat significance of TE from the whole set of sources to the target), it's dealing with an overwhelmingly large number of sources and freaks out. This wouldn't occur with the multivariate TE for two reasons. First, by conditioning on other selected sources, it's applying an automatic statistical break so just will not select enough ources to get to such a large set. Second, even if it did, it would realise in the selection process already that it was overwhelmed, and just stop selecting sources at that point, so the omnibus test could still run ok for it.

In terms of what to do here, I want to raise this with @pwollstadt @mwibral @LNov -- To be honest I don't remember our logic of running an omnibus test for the bivariate TE, when the point of the algorithm is returning pairwise results, what is the reason we're adding a multivariate test here? I'm assuming it's simply for consistency with the mTE algorithm, but a. the user probably isn't too interested in the collective result if they wanted pairwise anyway, and b. if it fails it looks like it then returns no sources, which is certainly not what we want for the bivariate TE? Have I missed something here?

In the interim, my gut feel Thomas is that you can get the bivaraite_te to skip the call to _test_final_conditional() (and set self.statistic_omnibus, self.pvalue_omnibus and self.sign_omnibus to null instead), but I'd like to see what Patricia/Michael/Leo think here.

--joe

mwibral commented 3 years ago

Thinking about it, it's very likely in general that a bivariate algorithm would select many more sources than a multivariate one. So this could be very well what happended, but importantly it also shows how a multivariate algorithm is superior in terms of focused results. Just like it was shown in the latest Novelli & Lizier paper :-).

Michael

On 22.03.21 00:27, Joseph Lizier wrote:

Hi Thomas,

So far as I can tell here, what's happening is that the bivariate TE algorithm is selecting many more sources than multivariate, and so then when the final omnibus test runs (which checks stat significance of TE from the whole set of sources to the target), it's dealing with an overwhelmingly large number of sources and freaks out. This wouldn't occur with the multivariate TE for two reasons. First, by conditioning on other selected sources, it's applying an automatic statistical break so just will not select enough sources to get to such a large set. Second, even if it did, it would realise in the selection process already that it was overwhelmed, and just stop selecting sources at that point, so the omnibus test could still run ok for it.

In terms of what to do here, I want to raise this with @Patricia/Michael/Leo -- To be honest I don't remember our logic of running an omnibus test for the bivariate TE, when the point of the algorithm is returning pairwise results, what is the reason we're adding a multivariate test here? I'm assuming it's simply for consistency with the mTE algorithm, but a. the user probably isn't too interested in the collective result if they wanted pairwise anyway, and b. if it fails it looks like it then returns no sources, which is certainly not what we want for the bivariate TE? Have I missed something here?

In the interim, my gut feel Thomas is that you can get the bivaraite_te to skip the call to _test_final_conditional() (and set self.statistic_omnibus, self.pvalue_omnibus and self.sign_omnibus to null instead), but I'd like to see what Patricia/Michael/Leo think here.

--joe

On Sat, 20 Mar 2021 at 06:45, Thomas Varley @.***> wrote:

I am attempting to use the BivariateTE().analyse_single_target() function and it keeps throwing the following error. The dataset is large, but it works fine with MultivariateTE.analyse_single_target() function. It is binary neuron recordings w/ approximately 1,849,957 bins and 100-150 processes.

The full error: Traceback (most recent call last): File "bte.py", line 55, in results = inference.analyse_single_target(settings, data, target=target) File

"/N/u/tvarley/Carbonate/.conda/envs/python3.7/lib/python3.8/site-packages/idtxl/bivariate_te.py", line 281, in analyse_single_target self._test_final_conditional(data) File

"/N/u/tvarley/Carbonate/.conda/envs/python3.7/lib/python3.8/site-packages/idtxl/network_inference.py", line 753, in _test_final_conditional [s, p, stat] = stats.omnibus_test(self, data) File

"/N/u/tvarley/Carbonate/.conda/envs/python3.7/lib/python3.8/site-packages/idtxl/stats.py", line 355, in omnibus_test statistic = analysis_setup._cmi_estimator.estimate( File

"/N/u/tvarley/Carbonate/.conda/envs/python3.7/lib/python3.8/site-packages/idtxl/estimators_jidt.py", line 588, in estimate var1 = utils.combine_discrete_dimensions(var1, self.settings['alph1']) File

"/N/u/tvarley/Carbonate/.conda/envs/python3.7/lib/python3.8/site-packages/idtxl/idtxl_utils.py", line 283, in combine_discrete_dimensions combined_values[t] = int(combined_value) OverflowError: Python int too large to convert to C long

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pwollstadt/IDTxl/issues/65, or unsubscribe

https://github.com/notifications/unsubscribe-auth/AAMKXQYCQPM6UWK3WGPHMF3TEOSV7ANCNFSM4ZPMWWUA .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pwollstadt/IDTxl/issues/65#issuecomment-803679690, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFJQGWE43GL7LNVFETFCJDTEZ6H3ANCNFSM4ZPMWWUA.

-- Prof. Dr. Michael Wibral Campus Institut Dynamik biologischer Netzwerke Georg-August Universität Göttingen

Kellnerweg 6 37077 Göttingen

Tel.: +49 (0)551 39 26661

thosvarley commented 3 years ago

Thanks @jlizier and @mwibral, what you say makes a lot of sense. In terms of avoiding the call to the omnibus test, would it just be a matter of setting some of the settings in the settings dictionary to None or 0, or do I need to go into the guts of my local installation of IDTxl?

@mwibral - wouldn't the bTE only infer more edges if the system was dominated by redundant rather than synergistic information dynamics (this is actually the question we're trying to answer for this project).

mwibral commented 3 years ago

Conceptually, the ominbus test is necessary for a proper statistical control of false positives across the nodes in your network (that each become targets of the info transfer, in turn). Waht we need from this test is the exact p-value, while the test itself practically almost always passes. To switch it off permanently you would have to go into the guts of the installation (it might be the easiest hack to rewrite the omnibus test function such that it just returns p=eps, i.e. the lowest possible value your machine can represent. But be warned the resulting network then has no correction for multiple comparisons across the various targets in your network, so your results can no longer be interpreted as an analysis of a network.

A better idea might be to impose much stricter settings for including sources, by changing the critical p-value for the greedy algorithm that searches for sources, e.g. to p< 0.01 or even p<0.001 or so. This will focus your analysis on the most reliable links in the network and thereby reduce the troublesome number of sources in the omnibus test. Also it would make it much faster in many cases.

Michael

On 22.03.21 15:12, Thomas Varley wrote:

Thanks @jlizier https://github.com/jlizier and @mwibral https://github.com/mwibral, what you say makes a lot of sense. In terms of avoiding the call to the omnibus test, would it just be a matter of setting some of the settings in the settings dictionary to None or 0, or do I need to go into the guts of my local installation of IDTxl?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pwollstadt/IDTxl/issues/65#issuecomment-804094073, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFJQGQ2VKWMX4HXLQ2MSOLTE5F57ANCNFSM4ZPMWWUA.

thosvarley commented 3 years ago

A better idea might be to impose much stricter settings for including sources, by changing the critical p-value for the greedy algorithm that searches for sources, e.g. to p< 0.01 or even p<0.001 or so. This will focus your analysis on the most reliable links in the network and thereby reduce the troublesome number of sources in the omnibus test. Also it would make it much faster in many cases.

Sorry @mwibral I'm a little confused - for the binary transfer entropy, I didn't think there was any greedy algorithm?

mwibral commented 3 years ago

Ah, sorry, I might have mistaken binary (~ valued variables) for bivariate (analysis in pairs). If it's the latter my recommendation holds.

Best,

Michael

On 23.03.21 12:55, Thomas Varley wrote:

A better idea might be to impose much stricter settings for including
sources, by changing the critical p-value for the greedy algorithm
that
searches for sources, e.g. to p< 0.01 or even p<0.001 or so. This will
focus your analysis on the most reliable links in the network and
thereby reduce the troublesome number of sources in the omnibus test.
Also it would make it much faster in many cases.

Sorry @mwibral https://github.com/mwibral I'm a little confused - for the binary transfer entropy, I didn't think there was any greedy algorithm?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pwollstadt/IDTxl/issues/65#issuecomment-804841946, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFJQGTRPCAHLCSCQ3MPQITTFB6U5ANCNFSM4ZPMWWUA.

mwibral commented 3 years ago

Looking at the original post, I see that it is indeed bivariate TE (i.e. a pairwise analysis, possibly on binary variables, but that's not the crucial point). In that case, my proposed solution still applies.

jlizier commented 3 years ago

Thanks for adding the clarifications @mwibral.

I don't agree that the (latter) suggestion to turn down the p-value for selecting sources is the right approach though. It will avoid the crash @thosvarley is seeing if the p-value is turned down low enough to only select a few sources as you say, but that means we're forcing the user to change the question they want to ask. They should in principle be able to ask for the set of sources selected in a bivariate manner at whatever p-value they're interested in; it's certainly achievable, and the only reason it crashes here is because of the omnibus test which as above I don't even think should be applied in the bivariate case. To restate my argument, whilst it's true that the omnibus test almost always passes for the bivariate measure, the point I was making is that a failure (or a crash for that matter) is meaningless, because a failure is about the multivariate set when the user has only asked about bivariate relationships - by running and changing the return values on the basis of the omnibus test we are answering a different question to what the user wants to ask in the bivariate case.(see footnote below) (Don't get me wrong, the omnibus test is great for the reasons you state above, but I think that's only applicable when we are making a multivariate greedy selection)

On these grounds, I would like to suggest then that we turn off the omnibus test permanently for the bivariate measures. To achieve this, I think we would simply:

Have I missed any good reasons to keep the omnibus test for bivariate?? I see that maybe the user might still be interested in the result, but if we want to keep the result for bivariate, then we should return the omnibus stats but not remove all candidate sources if the omnibus test fails or crashes (instead of removing them as we do now). I would support that alternative to killing the omnibus permanently if you prefer it.

One more answer to @thosvarley -

@mwibral - wouldn't the bTE only infer more edges if the system was dominated by redundant rather than synergistic information dynamics (this is actually the question we're trying to answer for this project).

Yes-ish, you'll only get extra sources compared to multivariate due to redundant information being there, but it wouldn't have to be a dominant effect necessarily (though it likely is)

Footnote -- I'm not sure I realised beforehand that we were running an omnibus test for the bivariate cases. Though given that we know as above that it almost always passes for these, the only problem with it is really when it crashes like this. Indeed, @LNov has re-run analyses from our recent paper you referred to above just now, and confirmed that excluding the omnibus test for the bivariate TE doesn't change the results.

mwibral commented 3 years ago

Good point @jlizier, I had not thought about different users having different questions.

However I have some worries about dropping the omnibus test: In the multivariate setting the omnibus test is necessary for the nested hierarchical testing -our way of mitigating the multiple comparison problem that plagues network inference in general. Given that we do multiple comparison correction for the bivariate analysis still with a max-stats analysis per target node, and do not provide mcp at the level of targets, I still feel you need to do some correction, however, to select between the targets to include in your reports and those that you will not include. [Someone @.***?) correct me if I am wrong here and we indeed do max-stats over all pairs of nodes in one go]

So at present I don't see how the network-level statistical reasoning would be different when you choose to do a bivariate investigation at the target-node level. You would still need some multiple comparison correction across targets and IMHO that would imply the need to give a p-value to the information flow into a target.

@thosvarley: https://github.com/thosvarley You'll get more links in the bivariate analysis because of many nodes having similar information (i.e. redundant information) - but there may be many more ways for this to happen than you may think of at present (e.g. cascade effects, common drivers and similar artifacts all lead to there being redudant information across nodes, and thereby more apparent information flows). Also, if most of the true information flows are indeed synergistic you will by definition not see them in a bivariate analysis.

Best,

Michael

On 24.03.21 03:13, Joseph Lizier wrote:

Thanks for adding the clarifications @mwibral https://github.com/mwibral.

I don't agree that the (latter) suggestion to turn down the p-value for selecting sources is the right approach though. It will avoid the crash @thosvarley https://github.com/thosvarley is seeing if the p-value is turned down low enough to only select a few sources as you say, but that means we're forcing the user to change the question they want to ask. They should in principle be able to ask for the set of sources selected in a bivariate manner at whatever p-value they're interested in; it's certainly achievable, and the only reason it crashes here is because of the omnibus test which as above I don't even think should be applied in the bivariate case. To restate my argument, whilst it's true that the omnibus test almost always passes for the bivariate measure, the point I was making is that a failure (or a crash for that matter) is meaningless, because a failure is about the multivariate set when the user has only asked about bivariate relationships - by running and changing the return values on the basis of the omnibus test we are answering a different question to what the user wants to ask in the bivariate case.(see footnote below) (Don't get me wrong, the omnibus test is great for the reasons you state above, but I think that's only applicable when we are making a multivariate greedy selection)

On these grounds, I would like to suggest then that we turn off the omnibus test permanently for the bivariate measures. To achieve this, I think we would simply:

Have I missed any good reasons to keep the omnibus test for bivariate?? I see that maybe the user might still be interested in the result, but if we want to keep the result for bivariate, then we should return the omnibus stats but not remove all candidate sources if the omnibus test fails or crashes (instead of removing them as we do now). I would support that alternative to killing the omnibus permanently if you prefer it.

One more answer to @thosvarley https://github.com/thosvarley -

@mwibral <https://github.com/mwibral> - wouldn't the bTE only
infer more edges if the system was dominated by redundant rather
than synergistic information dynamics (this is actually the
question we're trying to answer for this project).

Yes-ish, you'll only get extra sources compared to multivariate due to redundant information being there, but it wouldn't have to be a dominant effect necessarily (though it likely is)

/Footnote/ -- I'm not sure I realised beforehand that we were running an omnibus test for the bivariate cases. Though given that we know as above that it almost always passes for these, the only problem with it is really when it crashes like this. Indeed, @LNov https://github.com/LNov has re-run analyses from our recent paper https://doi.org/10.1162/netn_a_00178 you referred to above just now, and confirmed that excluding the omnibus test for the bivariate TE doesn't change the results.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pwollstadt/IDTxl/issues/65#issuecomment-805419705, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFJQGRVKO3DMVUDLNYCLDDTFFDD7ANCNFSM4ZPMWWUA.

pwollstadt commented 3 years ago

Hi everyone, sorry for the delay and thanks for the input @jlizier and @mwibral.

Answering the questions raised in your posts: @jlizier the Omnibus test was introduced to provide a multiple comparison correction at the level of targets (as @mwibral wrote in the last post). I would also say that in the logic of the hierarchical testing such a correction is required regardless of whether we do multivariate or bivariate inference of links. This correction would be missing without the omnibus test (@mwibral we do one max-stats analysis per target node). So, I would agree with @mwibral here and not generally switch off the omnibus test for the bivariate inference algorithms.

Regarding the original problem by @thosvarley, we found another bug that was introduced in the last release, which leads to the erroneous inclusion of irrelevant variables. This affects the multivariate TE as well as the selection of target past variables in bivariate TE estimation. We have uploaded the fix to the development branch, could you try whether that helps with your problem? I will further look into the statistics for the bivariate TE algorithm to see if everything else works as it should.

However, the problem of too large source sets for the omnibus test may still occur. I also see that the omnibus test typically passes such that the network inferred in earlier steps is not altered. Should we add an option to switch off the test if needed? This is currently not possible via the settings (@thosvarley, unfortunately, setting the no. permutations will not work and you indeed would have to go into the guts of the toolbox according to @jlizier pointers).

thosvarley commented 3 years ago

Thanks for all your help. @pwollstadt - I tried using the development branch, but it seems like the omnibus test now throws out everything? There will be 10+ selected variables, and the omnibus test will throw out all of them. The reason this is odd is that I've used the multivariate transfer entropy inference on the exact same data with the exact same settings and found it to be rich with connections. It seems unlikely that all the edges would only be visible under mTE conditions.

Based on what you guys said (that the single links should almost always pass the omnibus test), I'm not entirely sure what's going here.

I keep seeing the error: "AlgorithmExhaustedError encountered in estimations: Cannot instantiate JIDT CMI discrete estimator with alp1_base = 268435456, alph2_base=2, cond_base = 9. Try re-running increasing Java heap size. Halting omnibus test and setting to not-significant."

It seems like something like the overflow error is persisting. I will just go ahead and kill the omnibus test for now, following @jlizier's suggestion. EDIT: @jlizier I tried following your instructions, but the issue is that the "te" variable defined in line 319 is determined by the self._test_final_conditional(data), so when I make the changes we talked about, it errors out because it's looking for the self.statistic_single_link variable that doesn't exist.

pwollstadt commented 3 years ago

Hi Thomas, that still doesn't look right. From the error it seems like the algorithm still selects an extremely high number of source past variables (which you don't get when running multivariate TE). Just to confirm, this happens on the development branch when using BivariateTE() for network inference but not for MultivariateTE() and data are binary?

Could you post the analysis output for the crashing script here? Would it be possible to share a minimal script and data set that reproduces the error?

Thanks, Patricia

thosvarley commented 3 years ago

Thanks @pwollstadt, here is a script and a dataset that should reproduce the bugs. Just run python bte.py and it should start running out of the box. I was using the Dev. script when I ran this.

min_script.tar.gz

pwollstadt commented 3 years ago

Hi Thomas, I had a look at your data and it seems like the problem is really the relatively large network of 134 nodes with a lot of connections that iteratively detected when running the algorithm. When you use multivariate TE, the inference of sources just stops at some point because the conditioning set becomes too large to find further significant information contribution given the number of samples. This does not happen for bivariate TE because here each source process is considered independently. Only when the ombinus test is performed, the algorithm crashes because here we try to calculate the full information flow from all sources into the target.

I have added the possibility to skip the ombinus test by setting settings['n_perm_omnibus'] = 0 in a separate feature branch. Could you give this a try and see how it goes? In the mean time, we will discuss whether to make this a permanent feature for bivariate network inference.

thosvarley commented 3 years ago

I uninstalled the old version of IDTxl and download the feature version. When running, I got the following error:

Traceback (most recent call last): File "bte_time_resolved.py", line 55, in results = inference.analyse_single_target(settings, data, target=target) File "/home/thosvarley/.miniconda3/lib/python3.8/site-packages/idtxl/bivariate_te.py", line 292, in analyse_single_target self._test_final_conditional(data) File "/home/thosvarley/.miniconda3/lib/python3.8/site-packages/idtxl/network_inference.py", line 790, in _test_final_conditional [s, p, stat] = stats.omnibus_test(self, data) File "/home/thosvarley/.miniconda3/lib/python3.8/site-packages/idtxl/stats.py", line 387, in omnibus_test [significance, pvalue] = _find_pvalue(statistic, surr_distribution, File "/home/thosvarley/.miniconda3/lib/python3.8/site-packages/idtxl/stats.py", line 1409, in _find_pvalue check_n_perm(distribution.shape[0], alpha) File "/home/thosvarley/.miniconda3/lib/python3.8/site-packages/idtxl/stats.py", line 1277, in check_n_perm if not 1.0 / n_perm < alpha: ZeroDivisionError: float division by zero

thosvarley commented 3 years ago

In the meantime, I've decided to do all the pairwise TEs "manually" by using the sources and targets arguments in the infer_single_target function. It takes a bit longer since you have to re-compute the embedding of the source a large number of times, but it works (and happens to be a better replication of the way TEs have been calculated in previous studies anyway).

jlizier commented 3 years ago

Hi all - thanks for adding the option for Thomas, Patricia.

Just wanted to add that I am firmly in favour of permanently adding an option to skip the omnibus test for bivariate investigations. I understand the reasons you and Michael have suggested for why one might want to use an omnibus test for bivariate, but as per my previous posts imo they're normally orthogonal to what I would want a bivariate study to do and I suspect for others also (at the very least for replicating previous studies). Having the option to specify it either way looks to be a good solution.

Re the bug triggering Thomas' error two posts above, it's not clear to me what's triggering that; I would have thought Patricia's new code on line 343 of stats.py would have returned from the omnibus_test() method before the error at line 387. I'll leave that to Patricia to comment on the new code ...