Infomax yields different results depending on Python distribution

cbrnr commented 8 years ago

In https://github.com/scot-dev/scot/issues/160, I noticed that I get different ICs depending on whether I run my script on Anaconda Python or Python installed via Homebrew. In addition, Homebrew Python gives different values when run from a script as compared to running interactively.

Since this problem occurred with real EEG data, it was a bit difficult to create a toy example that reproduces this issue. Here's what I came up with:

import numpy as np
from mne.preprocessing.infomax_ import infomax

np.random.seed(1)
data = np.random.randn(10000, 64)
data[1000:1010, 10] *= 10000
w = infomax(data, random_state=1)

Just like in #3047, the problem appears only with standard Infomax (extended=False), not with the extended version (extended=True) - although these two issues might or might not be related.

When I run the example in Anaconda Python 3.5.1 and then in Homebrew Python 3.5.1, I get a maximum difference in w of around 0.006 - which is way too large to dismiss as numerical inaccuracy. When I run the same script interactively from Homebrew IPython via %run, I also get a difference of 0.006 compared to Homebrew Python script mode. This difference script vs. interactive does not exist in Anaconda Python (there the difference is exactly 0.0).

In my real EEG data, components not only change in their order, but also in their values (as seen on topoplots for example).

When I set extended=True, the problem disappears and the maximum difference is 4.34e-18 (so essentially zero).

larsoner commented 8 years ago

I noticed that I get different ICs depending on whether I run my script on Anaconda Python or Python installed via Homebrew. In addition, Homebrew Python gives different values when run from a script as compared to running interactively.

I was going to ask if you were setting the random seed, but it seems you are. Moving on...

My guess is that the differences are driven by small differences in numerical error between different implementations. Anaconda should be using MKL, other builds might be using OpenBLAS or even a naive implementation of the linear algebra functions, and these could all have slightly different numerical errors. If the Infomax algorithm does not deal with these errors properly, then it would give rise to different behaviors.

To test, what I would do is use debugging to dig into individual steps in two installations in parallel. It would be a pain to set up, but it would allow you to definitively say where the differences arise.

larsoner commented 8 years ago

And the extended=True being more robust suggests to me that the extended algorithm is somehow more robust or better implemented than the non-extended version. But I know very little about ICA, so that's about all I can suggest :(

cbrnr commented 8 years ago

My guess is that the differences are driven by small differences in numerical error between different implementations.

This is certainly possible. However, why there's a difference within the same Python distribution (installed via Homebrew) depending on whether I run a script or I run the very same code interactively completely eludes me at this point.

To test, what I would do is use debugging to dig into individual steps in two installations in parallel. It would be a pain to set up, but it would allow you to definitively say where the differences arise.

Right, it was already a pain to come up with the toy example. I guess I'll wait and see how this discussion develops before I do something that time consuming.

Point is, the current standard Infomax implementation seems to be problematic, and I'd be more than happy to help solve this issue (but I can't do it on my own).

larsoner commented 8 years ago

depending on whether I run a script or I run the very same code interactively

I missed this bit originally. You mean the only difference is python vs python -i, or by "run interactively" you mean e.g. launch from Spyder?

cbrnr commented 8 years ago

I missed this bit originally. You mean the only difference is python vs python -i, or by "run interactively" you mean e.g. launch from Spyder?

I run the script in four different ways:

Anaconda 3.5.1 via python script.py
Homebrew 3.5.1 via python script.py
Anaconda 3.5.1 interactively (%run script.py from IPython)
Homebrew 3.5.1 interactively (%run script.py from IPython)

1 and 3 yield identical results. 2 differs from 1 and interestingly also from 4.

larsoner commented 8 years ago

One thing you could test is whether Anaconda with MKL (conda install numpy scipy mkl) versus without (conda install numpy scipy nomkl) gives equivalent results. You could do each in a virtual environment (conda create -n py27nomkl python=2.7; conda install numpy scipy nomkl) to keep them isolated. If it fails, it gives you two nice environments to work with.

agramfort commented 8 years ago

numerical changes during the optimization can create big difference in the end.

@jmontoyam will comment soon as he reported this to me in the past.

cbrnr commented 8 years ago

@Eric89GXL yes, maybe MKL is an issue here. I created two Python 3.5 envs, one with mkl, the other with nomkl. The maximum absolute difference between the weights as returned by Infomax is as large as 0.006. Anaconda's nomkl env returns exactly the same weights as Homebrew Python (so it seems both distributions are using OpenBLAS or something).

agramfort commented 8 years ago

if you use the same linear algebra libs (Accelerate on mac or mkl) it's expected you find the same numbers. Otherwise there is no way you can guarantee this.

cbrnr commented 8 years ago

Yes, but from a user perspective the effect is pretty huge: in real data, components look different and are in a different order. Using the exact same data and Python code, the only difference is the used numerical libs in the background. Don't you think this is an issue here? Since it only occurs for standard Infomax, I guess this algorithm could probably be made more stable. Alternatively, since everyone is using extended Infomax, maybe standard Infomax should be deprecated or at least a warning message should be generated.

agramfort commented 8 years ago

@jmontoyam can you pitch in?

jona-sassenhagen commented 8 years ago

since everyone is using extended Infomax, maybe standard Infomax should be deprecated or at least a warning message should be generated.

This is my preference. "infomax" and "extended-infomax" should become the same thing (with a deprecation warning for "infomax"). If you want, you can still turn off extended via the fit params.

agramfort commented 8 years ago

did you empirically observe gains with the extended version? any performance drawback?

jmontoyam commented 8 years ago

Hi @cbrnr, as @Eric89GXL and @agramfort have mentioned, if we use 2 different python envs, one with mkl (anaconda), and one with nomkl (e.g. homebrew), they will use different numerical lineal algebra libraries, therefore it is expected to get "slightly" different solutions. How "slightly" will the solutions differ?...that strongly depends on the algorithm and data used. For instance, when I run your example I noticed something interesting, the following warning message:

RuntimeWarning: overflow encountered in exp y = 1.0 / (1.0 + np.exp(-u))

Therefore, I decided to turn on the numpy capability to print all the warning messages (np.seterr(all='print')), and then I run again your example. I noticed that using infomax with this example data, it raised a lot of "Warning: overflow encountered in exp" messages (I would also recommend to enable the verbose=True parameter when you call infomax, this way you will see the logging info across iterations, which facilitates the comparison of mkl vs nomkl performance). Under normal circumstances, 2 different numerical linear libraries will behave "slightly" different, but when they encounter with under/over-flow errors, they will respond different to these errors, and these differences will start to accumulating during the iterations, therefore the final result will be different. To test this assertion, please comment the line "data[1000:1010, 10] *= 10000", then run again your example, and compare both matrices: np.max(np.abs(w_nomkl - w_MKL)) = 1.9706458687096529e-15

The next question is: Using this example, why this behaviour happen when using infomax, and not when using extended-infomax? When I run your example (without commenting the line mentioned previously) using extended-infomax, there were no "warning overflow" messages, therefore it is expected similar solutions from the mkl and nomkl envs: np.max(np.abs(w - w_MKL)) = 2.2204460492503131e-16

Why extended-infomax did not raise any overflow error? If we take a look at the infomax source code, we can see that the non-linearity of extended-infomax involves the hyperbolic tangent function, which is bounded, whereas the non-linearity of infomax involves the exponential function, which may be unbounded, therefore it may raise under/over-flow errors, which it turn may impact the comparison of two different numerical linear algebra libraries.

In summary, we should be very careful when comparing the same algorithm/data using different numerical linear algebra libraries.

cbrnr commented 8 years ago

@jmontoyam I know that the discrepancy only occurs when some samples are large in comparison to the rest. You're right, this is due to the exponential non-linearity in standard Infomax - thanks for the nice explanation!

However, the problem is not the example - these differences between MKL and noMKL appear in real EEG data. I've only constructed this toy example to show what's going on.

agramfort commented 8 years ago

@jmontoyam https://github.com/jmontoyam I know that the discrepancy only occurs when some samples are large in comparison to the rest. You're right, this is due to the exponential non-linearity in standard Infomax - thanks for the nice explanation!

However, the problem is not the example - these differences between MKL and noMKL appear in real EEG data. I've only constructed this toy example to show what's going on.

yes it will happen. There is no way around it. It's not a bug. What is it you want to fix?

cbrnr commented 8 years ago

@agramfort making extended Infomax the default will fix the problem. In addition, if people manually choose to use standard Infomax, maybe a warning message could be generated saying that standard Infomax is sensitive to numerical errors or something. I don't know if deprecating standard Infomax is necessary - if people want to use it, they should be able to do so (but they should be aware of potential pitfalls).

jmontoyam commented 8 years ago

@cbrnr, as @agramfort mentioned in the previous comment, this is not a bug. In general, if we compare the same algorithm using different numerical linear algebra libraries, we could get "slightly" different solutions, how different? this depends heavily on the algorithm and the data used in the comparison. Indeed, this behaviour could also happens in extended-infomax using real EEG data. To test this assertion, I did the following 2 experiments: using real EEG data from one subject, I run the extended-infomax version from eeglab in matlab, in one experiment I used the standard mkl version shipped with matlab (11.1.1), and in the other experiment I used the newer mkl version shipped with anaconda (11.3.1). In both experiments I used exactly the same parameters, and in both experiments extended-infomax finished using exactly the same number of iterations. When I compared the unmixing matrices, I got the following absolute difference:

max(max(abs(w_mkl_new - w_mkl_old))) = 0.0033

It is exactly the same matlab code, using the same numerical linear algebra library but different version of it, why does this happen? Floating point "algebra" is really tricky and curious, and in general floating point operations are not even commutative. Different numerical linear algebra libraries (or different versions of the same library), could be generate "slightly" different Assembly/Machine code, for instance, let's say, they could slightly modify the order of the operations in order to optimize the use of the processor cache or even the register usage, therefore, taking into account that floating point operations are not in general commutative, for some datasets this could introduce slightly differences in the results of each iteration, and these differences will start to accumulating during the whole execution, therefore the final result will be different.

larsoner commented 8 years ago

Thanks for the explanation @jmontoyam. WDYT about @cbrnr's suggestion earlier:

since everyone is using extended Infomax, maybe standard Infomax should be deprecated or at least a warning message should be generated.

@jona-sassenhagen likes it, and I agree it might be a good way to go. I'm happy with either deprecation or a clear warning message like "infomax can be sensitive to differences in floating point arithmetic, we recommend using extended infomax for enhanced reproducibility and stability"?

agramfort commented 8 years ago

Thanks for the explanation @jmontoyam. WDYT about @cbrnr's suggestion earlier:

since everyone is using extended Infomax, maybe standard Infomax should be deprecated or at least a warning message should be generated.

infomax is still the default in eeglab ... let's not deprecate it.

@jona-sassenhagen likes it, and I agree it might be a good way to go. I'm happy with either deprecation or a clear warning message like "infomax can be sensitive to differences in floating point arithmetic, we recommend using extended infomax for enhanced reproducibility and stability"?

what @jmontoyam is telling you is that saying that extended-infomax being more stable is so far not clear. extended-infomax does suffer from the same numerical issues.

larsoner commented 8 years ago

If we take a look at the infomax source code, we can see that the non-linearity of extended-infomax involves the hyperbolic tangent function, which is bounded, whereas the non-linearity of infomax involves the exponential function, which may be unbounded, therefore it may raise under/over-flow errors, which it turn may impact the comparison of two different numerical linear algebra libraries.

@agramfort I took this to mean that extended-infomax would be more stable due to its use of a bounded nonlinearity, no?

jona-sassenhagen commented 8 years ago

infomax is still the default in eeglab ... let's not deprecate it.

Nope, extended is standard:

screen shot 2016-04-08 at 16 15 23

We should consider what it would mean to default to extended in MNE. In EEGLAB, the difference between extended and regular is a keyword. In MNE, it is, too, but we somehow also allow a string method; we have two different ways of accessing extended infomax, either as method='extended-infomax', or with method=infomax ... fit_params={'method':'extended'} (or something like that).

jona-sassenhagen commented 8 years ago

Actually, method= overwrites the 'extended' part of fit_params. This is less than optimal IMO.

In my opinion, infomax and extended-infomax should default to just the same thing, with fit_params defaulting to extended=True for both. You can manually set it to not extended by setting extended to false in fit_params. Same API, no breakage, arguably more what you'd expect it to do, more like EEGLAB.

jona-sassenhagen commented 8 years ago

@agramfort I'm currently looking a bit at extended vs. old infomax in the literature and my own data btw. Maybe normal infomax is not as bad as I've been told ...

jona-sassenhagen commented 8 years ago

In a quick experiment I did, extended infomax is indeed better at isolating line noise (the primary benefit claimed by Lee et al. 1998). unknown-134

Informally speaking, it's also if at all faster than non-extended infomax.

jona-sassenhagen commented 8 years ago

And here is fastica ... It's actually the best!

unknown-136

(@kingjr)

Edit: wait, I might have done something wrong with the experiment ... running again.

agramfort commented 8 years ago

see https://github.com/mne-tools/mne-python/pull/3108

basically ICA defaults to fastica and if you want infomax or extended-infomax you need to change the method param explicitly

ok?

jona-sassenhagen commented 8 years ago

Ran the results again (I had a bug) and while not as striking, the overall picture is the same, with Fastica > extended > old.

unknown-137

agramfort commented 8 years ago

ok so happy with my PR?

jona-sassenhagen commented 8 years ago

I'd rather go the other direction - deprecate the extended-infomax method keyword, and allow setting extended via fit params (with both 'infomax' and 'extended-infomax' defaulting to extended=True). That would be more like EEGLAB does it, and it would also make extended the default for infomax, as 1. it arguably should be, 2. EEGLAB does it. Do you understand what I'm trying to say?

agramfort commented 8 years ago

I'd rather go the other direction - deprecate the extended-infomax method keyword, and allow setting extended via fit params (with both 'infomax' and 'extended-infomax' defaulting to extended=True). That would be more like EEGLAB does it, and it would also make extended the default for infomax, as 1. it arguably should be, 2. EEGLAB does it. Do you understand what I'm trying to say?

this requires an API change and since infomax is not the default then if people want to change the method they need to do explicitly and if they want the extended version they do it on purpose and give credit to Lee et al. and not to Bell et al.

cbrnr commented 8 years ago

@jona-sassenhagen I don't think comparing the performance of ICA algorithms on line noise is necessarily representative of their performance on EEG/MEG data. That said, Infomax doesn't seem to do well on isolating line noise, but it excels in producing dipolar sources - much better than e.g. FastICA. Here's a nice comparison of some 20 ICA algorithms regarding their performance on EEG data: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030135

Figure 4 is particularly interesting - it shows that Infomax-based algorithms (standard Infomax, extended Infomax, and AMICA) perform best in terms of the two criteria used in this comparison.

What I want to say regarding the use of Infomax in MNE, I think standard and extended Infomax are pretty much comparable in performance. However, almost all EEG people I know use extended Infomax, because it is an improvement upon the original algorithm. It is also the default choice in EEGLAB.

Given that plus the fact that standard Infomax is obviously more sensitive to numerical errors, I'm in favor of extended Infomax over standard Infomax.

Since I use the Infomax function directly, I cannot comment on defaults and API changes pertaining to MNE as a whole (but an API change is always something that shouldn't happen too lightly).

jona-sassenhagen commented 8 years ago

Clemens: the paper introducing the extended algorithm (lee et al 1999) presented the ability to isolate line noise (as a non Gaussian component) as the case where extended has benefits over old style.

cbrnr commented 8 years ago

Right, in theory extended Infomax can isolate sub-Gaussian components such as line noise, whereas standard Infomax should completely fail in this case (as it only separates super-Gaussian components). In my experience, extended Infomax greatly reduces line noise, but sometimes it fails to completely isolate it (maybe due to non-stationarity of the noise). In any case, this is another reason why I would choose the extended over the standard implementation.

larsoner commented 8 years ago

Given that plus the fact that standard Infomax is obviously more sensitive to numerical errors, I'm in favor of extended Infomax over standard Infomax.

@agramfort this is what I referenced earlier -- can you add this to your PR? My original suggestion was "infomax can be sensitive to differences in floating point arithmetic, we recommend using extended infomax for enhanced reproducibility and stability" but perhaps that is too strong.

agramfort commented 8 years ago

better?

mne-tools / mne-python

Infomax yields different results depending on Python distribution #3049