mne-tools / mne-python

MNE: Magnetoencephalography (MEG) and Electroencephalography (EEG) in Python
https://mne.tools
BSD 3-Clause "New" or "Revised" License
2.7k stars 1.31k forks source link

ICA results depend on numerical library? #4922

Closed cbrnr closed 5 years ago

cbrnr commented 6 years ago

We've discussed this in #3049 already, but I think we haven't really solved the problem. We found out that Infomax is less stable than Extended Infomax, but the problem also occurs in the latter.

Recently I've had a discussion with @dengemann, where he mentioned that he gets different ICA results across different machines, even though the initial conditions (random seeds) are identical.

I have done a very preliminary test where I decomposed EEG data with Extended Infomax ICA using the following three BLAS implementations on Arch Linux:

  1. Standard BLAS
  2. OpenBLAS
  3. MKL (from Anaconda)

Not surprisingly, standard BLAS is slow because it is single-threaded. OpenBLAS was 3.1x faster, and MKL was 3.2x faster.

Then I compared the resulting unmixing matrices. Surprisingly, they are identical for BLAS and OpenBLAS (np.allclose returns True), but MKL differs. Not only can the sign of each component change (which we could solve by enforcing e.g. a positive sign for the first element of the largest component), but in general elements of the matrix differ in the third place after the decimal point. Here are the first 8 elements of the BLAS/OpenBLAS result:

array([ 0.12361416, -0.00420886,  0.12972655, -0.00492072,  0.01694731,
        0.16542082, -0.055074  , -0.05557361])

And here are the elements from the MKL result:

array([-0.12356685,  0.00327656, -0.13013124,  0.00669525, -0.01721278,
       -0.16554963,  0.05511959,  0.05629636])

What do you think? Should we accept that different numerical libraries compute things differently and therefore numerical errors blow up? I'm not entirely happy with this situation, to the point where using ICA could be a problem (e.g. if people say ICA is too unreliable so don't use it). Why are the results of BLAS and OpenBLAS identical? These are two independent libraries. And why is MKL different? Couldn't this be an MKL issue?

larsoner commented 6 years ago

Should we accept that different numerical libraries compute things differently

This is a fact of life in floating point computations across libraries, architectures, etc., so we have to accept it :) Software needs to be designed to robustly mitigate rather than exacerbate this issue.

therefore numerical errors blow up

This depends on the algorithm being used. By way of example, FIR filtering in general does not suffer from these numerical imprecision issues because it's robust to small errors (they do not accumulate or get amplified) -- it's why we can use FFTs instead of direct computation for speed, despite the two being identical only to numerical precision. Something like high-order IIR filtering on the other hand can suffer, as having lots of poles (high order) can lead to an unstable filter -- their proximity to the unit circle can "break" filtering -- so appropriate mitigation strategies must be used, such as second-order sections representations that reduce rather than amplify numerical precision problems.

As far as why BLAS / OpenBLAS are equivalent, it might be a matter of luck (they are probably not always numerically identical), or it might be that they happen to do the same set of operations (e.g., summation order, use of accumulators, use of SIMD instructions in OpenBLAS in ways that don't affect the output). I doubt that MKL is doing something incorrect with their computations, though it's always possible.

I think in general we should be wary of using algorithms that give appreciably different results using different libraries. We should improve them or find strategies to mitigate these effects. I'm not sure if we're there with ICA or not.

agramfort commented 6 years ago

yes ICA is sensitive to blas implementation. There is not much to do about it as @larsoner said.

cbrnr commented 6 years ago

This is a fact of life in floating point computations across libraries, architectures, etc., so we have to accept it :) Software needs to be designed to robustly mitigate rather than exacerbate this issue.

Yes, this is clear, I was referring to the (Extended) Infomax ICA implementation. We're definitely not there with ICA (at least the MNE Infomax implementation is affected by numerical issues, I haven't tested other implementations). I will try to see if this is also an issue with FastICA, and then it will be interesting to see if other Infomax implementations suffer from the same problem.

yes ICA is sensitive to blas implementation. There is not much to do about it

I'm not sure if ICA is in general sensitive to BLAS implementations, right now I only know this is the case for the MNE implementation. Or do you know of studies that looked at this issue for ICA in general? If this is really the case for ICA in general, then I couldn't recommend using it if results are determined by chance.

cbrnr commented 6 years ago

The results computed on a Mac using Accelerate are identical to the BLAS/OpenBLAS results. I don't know, I'll do some more tests, but if only MKL produces different results this could be a problem.

cbrnr commented 6 years ago

MKL on Mac also yields results identical to BLAS/OpenBLAS/Accelerate. So currently only MKL on Linux produces different results. I'm starting to think this is not an ICA issue at all.

mmagnuski commented 6 years ago

I don't think this should be a big issue - after all you get slightly different ICA results with different initializations and the sign arbitrariness of components is a fact of life too. If I want to reproduce the exact same components I just save them to disk. But I admit that its a bit weird that MKL gives different results only on Linux.

agramfort commented 6 years ago

honestly do tiny tests but don't spend too much time on this.

it's not a bug.

my 2c

cbrnr commented 6 years ago

@agramfort yeah, I can't spend much more time on it anyway, but ask @dengemann, this is a big issue for him. If we find out that it's just a particular MKL implementation that misbehaves, then why not investigate further. With identical initial conditions, I expect to get identical results. This seems to be the case everywhere except for MKL on Linux (I'll do a Windows test soon).

agramfort commented 6 years ago

you won't be able to guarantee the MKL version people will use. So you have to live with it.

cbrnr commented 6 years ago

If all results are identical (even MKL results) except for Linux MKL, I would consider this a bug.

agramfort commented 6 years ago

event matlab would ship with this MKL version and does not guarantee numerical matches between systems.

see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3365894/

Freesurfer has this pb too and there is not much to do about it.

cbrnr commented 6 years ago

Thanks for the article link. I'll still try and see how MKL results on Windows compare. If nothing else, maybe we could at least consistently deal with the sign - do you think this is possible/useful (e.g. sort ICs according to explained variance, set the first weight of the largest component to be positive)?

mmagnuski commented 6 years ago

The ICs are already sorted by explained variance, I would prefer to make a function / method available (ica.invert_signs(picks=None)?) rather then to enforce the sign by default.

cbrnr commented 6 years ago

MKL on Windows also gives exactly the same results as all other OS/BLAS combinations. So the only different result is Linux/MKL. I'm sure there's an explanation, but I wouldn't dismiss it as "that's how numeric floating point life is" given the circumstances. This might well be a serious problem @dengemann has encountered that lead to non-reproducible results when using Linux/MKL. Not sure how to proceed though.

dengemann commented 6 years ago

The experience we've had with @jasmainak @larsoner and @agramfort while working on the latest MNE group analysis demo paper was a bit traumatizing to me. I am starting to think that we may want to have alternatives to ICA in our pockets when strong reproducibility is a concern.

cbrnr commented 6 years ago

@dengemann yes, but this is something I would want to triple-check before declaring ICA as too unreliable. I still don't think that this is the case, because I get numerically identical (and therefore reproducible) results everywhere except for one platform/BLAS combination.

jasmainak commented 6 years ago

@cbrnr if you figure out what is different for the Linux MKL version, do not hesitate to post it. I have a feeling it might simply be a matter of setting some environment variables (like number of threads = 1).

See: https://software.intel.com/en-us/articles/getting-reproducible-results-with-intel-mkl/

That said, I do agree that ICA seems like a sensitive algorithm. Even if we do manage to find out what makes it reproducible, I fear this would be only a monkey patch hiding the real problem.

cbrnr commented 6 years ago

Yes, maybe it's the number of threads. Small variations in the output are OK, but if they are really that large ICA would be a nightmare for reproducibility. I'm not aware that anyone has ever studied this, but it would be very useful. So far we are only speculating, and I couldn't say whether this problem is inherent in ICA or if it is due to a particular implementation.

cbrnr commented 6 years ago

I'm still not convinced that this can be blamed on ICA being a sensitive algorithm. I did some more tests with Linux MKL changing MKL_MAX_THREADS. The results are weird. With 1 and 3 threads, I get numerically identical results to all other results I mentioned before. I only get different results with 2 and 4 threads, and the differences are too large to be ignored.

If I got different results on other platforms using MKL, or different results using other libraries, I wouldn't bother any further. But it is really weird that results are reproducible for all libs and all platforms except for MKL Linux 2 and 4 threads...

larsoner commented 6 years ago

There is a long-standing MKL bug on Linux having to do with an interaction between number of threads and LAPACK GESDD at least:

https://software.intel.com/en-us/comment/1869912

This for me has manifested usually as a LinAlgError but it's possible this is spurious and results can also be wrong without raising an error (which is actually much worse!).

@cbrnr can you try the sample files on that Intel thread and see which pass/fail on your Linux system?

cbrnr commented 6 years ago

Interesting. Unfortunately, all three .npy files work on my Linux system with any number of threads (=1, 2, 3, and 4).

larsoner commented 6 years ago

Actually there are 6 files (I added more as I found new failures with CPU updates):

https://staff.washington.edu/larsoner/bad.npy https://staff.washington.edu/larsoner/bad_avx.npy https://staff.washington.edu/larsoner/bad_sse42_2.npy https://staff.washington.edu/larsoner/bad_sse42_3.npy https://staff.washington.edu/larsoner/fail.npy https://staff.washington.edu/larsoner/bad_kabylake.npy

BTW what is your CPU model (cat /proc/cpuinfo | grep "model name")?

If you are interested in pursuing this a bit, I'd recommend trying to isolate the problem to a specific failing NumPy or SciPy function. But this is quite time consuming...

In the meantime maybe we should do some speed tests for OpenBLAS vs MKL. My understanding is that they are pretty close with the exception of a few isolated LAPACK functions that we don't use extensively here (as opposed to e.g. OpenMEEG which does use them). We might actually consider recommending nomkl and putting this in our environment, but we'll have to do benchmarking legwork.

larsoner commented 6 years ago

... and by "pursuing this", I mean "figuring out why MKL breaks ICA on Linux" (not my SVD problem)

cbrnr commented 6 years ago

Indeed, bad_kabylake.npy results in LinAlgError: SVD did not converge only for MKL_NUM_THREADS >= 4 (that is, I do not get this error for 1, 2 or 3 threads). My CPU model name is "Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz" (4 cores).

What does this mean for the ICA issue? Do you think this could be related? How could I pursue this given that I neither get an error nor a warning?

Regarding OpenBLAS vs. MKL, I think Anaconda's decision to ship MKL by default is rather controversial because MKL is not open source. To be clear, this is not about the quality of MKL, but a general concern about licensing. In my very limited speed comparisons, OpenBLAS was almost as fast as MKL. Does MNE have any tests dedicated to benchmarking?

larsoner commented 6 years ago

What does this mean for the ICA issue? Do you think this could be related?

It's possibly related. The LinAlgError might just be the most visible manifestation of an incorrect computation. I suspect that even when it's not raised, some computations could still be wrong (just silently so).

How could I pursue this given that I neither get an error nor a warning?

The tedious but perhaps necessary approach would be to drill down to some command (e.g., np.dot, linalg.svd, etc.) that actually wraps to an MKL call (likely LAPACK) that gives different results depending on the number of threads.

I would do it by working backward. Let's say you know the output with N_THREADS=1 is correct, i.e., matches all other implementations / systems (if it's not then swap in "using OpenBLAS" for all "N_THREADS=1" in what follows). We know that the input data to ICA when running with N_THREADS=1 and N_THREADS=4 is np.allclose (right?), but that the output is np.allclose with N_THREADS=1 but not N_THREADS=4. I would work to pinch this down, probably from the back end (though you could go from the front if you want):

  1. Set the number of iterations of the algorithm to 1 (if possible) to see if the output still differs. This will simplify things if possible. If not, things can still proceed it's just perhaps more annoying.
  2. Run with N_THREADS=1, put in a np.save-like call to save any intermediate arrays in the second-to-last step.
  3. Add in assert_allclose(...) statements for the given computation and the np.load of data from N_THREADS=1 for that step. These should pass for N_THREADS=1 but fail for N_THREADS=4.
  4. Repeat steps 2 and 3 (moving backward in algorithm processing steps) n times until both N_THREADS=1 and N_THREADS=4 pass. This tells you that the problem is at step -n (negative indexing :) ).
  5. To verify that it really is step n (which I now think might be a linalg.svd call), swap in the outputs for N_THREADS=1 rather than computing this step. If all assert statements now pass with N_THREADS=1 and N_THREADS=4 after making this change, then it's definitely that step causing the / a problem, and we now have the data files necessary to make a minimal failing example for Intel.
larsoner commented 6 years ago

Does MNE have any tests dedicated to benchmarking?

No not officially. I would run make test a few times locally on MKL, noting the slowest times and total time, then do the same thing on OpenBLAS. This would be a good start.

cbrnr commented 6 years ago

Thanks, this sounds like a good plan and like a lot of work :smile:. Now the main problem is to find the time to work on it.

FWIW, I found that bad_kabylake.py also results in an error on macOS using Anaconda/MKL with MKL_NUM_THREADS=4. On Windows (in a VM), I don't get this error.

larsoner commented 6 years ago

@cbrnr any time to look into this? I suspect that finding such a "silent bug" might be something that would get Intel to look into things, as it could be quite important.

larsoner commented 6 years ago

Or @cbrnr alternatively, if you can write or point me to the code (hopefully minimal) that you have seen be different on two platforms, I can see if I can get my different machines to also give me different results, and pursue it myself.

Is it just the top comment from https://github.com/mne-tools/mne-python/issues/3049#issue-142660958 ?

cbrnr commented 6 years ago

Unfortunately, the little toy example was related to a different issue.

I will try to prepare a minimal code example to demonstrate the different behavior - this will likely take some time, since currently I'm using a full fledged EEG analysis pipeline with a large data set. I'll see what I can do.

cbrnr commented 6 years ago

@larsoner I've created a toy example that could expose the problem with the MNE sample data:

import os
import time
from itertools import combinations
import numpy as np
import mne
from mne.preprocessing import ICA
from mne.datasets import sample

data_path = sample.data_path()
raw_fname = data_path + '/MEG/sample/sample_audvis_filt-0-40_raw.fif'
raw = mne.io.read_raw_fif(raw_fname, preload=True)
raw.pick_types(meg=False, eeg=True)
raw.filter(1, None, fir_design='firwin')

icas = []
for threads in range(1, 5):
    os.environ["MKL_NUM_THREADS"] = str(threads)
    t0 = time.time()
    ica = ICA(method='extended-infomax', random_state=1).fit(raw)
    dur = time.time() - t0
    print("### ICA using MKL_NUM_THREADS={} took {}s ###".format(threads, dur))
    ica.save('mkl-{}-ica.fif.gz'.format(threads))
    icas.append(ica.unmixing_matrix_)

for pair in combinations(icas, 2):
    assert np.allclose(*pair)

Unfortunately, I don't have access to my Linux machine until mid next week, so I'm not sure if we are going to see this problem with this particular data set. If you want, you could go ahead and run this on your Linux machine to see if it works (I hope that setting the environment variable via os.environ is doing the right thing). Otherwise, I'll try that next week.

larsoner commented 6 years ago

IIRC setting threads only works at the beginning before anything is called. But I can still try this with np.save etc. to see if I can find differences.

cbrnr commented 6 years ago

Unfortunately I cannot reproduce the issue with the sample data.

I can still reproduce it with my BDF data set though, but now only MKL_NUM_THREADS=2 yields a slightly different result (previously, this was the case for 2 and 4 threads). Maybe because conda updated the MKL package recently. It doesn't matter, because the issue still persists.

Setting max_iter=1 results in identical outputs for any number of threads, so unfortunately it's not that easy. I'll try to isolate it further when I have more time.

cbrnr commented 6 years ago

Furthermore, even though the input data is identical for all number of threads (np.allclose), when I save the raw data via raw.save I get different results. Specifically, ICA results with 1 thread are different from the rest (which are identical). The difference is not as pronounced as when starting with the BDF data (see first post in this thread), but still large enough so that np.allclose is False. This makes it a bit difficult to create a nice little example.

I won't have much time for this in the near future, so I'll leave this as is for now. One last note, FastICA doesn't have this problem, it seems to be specific to Extended Infomax.

larsoner commented 6 years ago

I can still reproduce it with my BDF data set though, but now only MKL_NUM_THREADS=2 yields a slightly different result

Can you share the script and (privately with me if you want) the file to replicate?

cbrnr commented 6 years ago

Sure, we'd need to go the private way though, because I can't share the data publicly. How can I contact you?

larsoner commented 5 years ago

Closing since whatever is left to do here is an upstream numerical lib bug (and I think we updated docs at some point?)