octopize / saiph

A projection package
https://saiph.readthedocs.io
Apache License 2.0
0 stars 0 forks source link

Fix inverse transform changing the distribution a lot for MCA #72

Open tcrasset opened 2 years ago

tcrasset commented 2 years ago

The current state of inverse_transform during mca, yields in the test test_inverse_from_coord_mca, that there is a (big) discrepancy of results between MacOS M1 (the right result) and Linux Intel (the left result). Indeed, the test passes for MacOS, but fails on Linux, with the following values.

saiph/inverse_transform_test.py:136: in test_inverse_from_coord_mca
    assert_series_equal(
pandas/_libs/testing.pyx:52: in pandas._libs.testing.assert_almost_equal
    ???
pandas/_libs/testing.pyx:167: in pandas._libs.testing.assert_almost_equal
    ???
E   AssertionError: Series are different
E   
E   Series values are different (75.0 %)
E   [index]: [count, mean, std, min, 25%, 50%, 75%, max]
E   [left]:  [683.0, 1.16398243045388, 0.5648994509823707, 1.0, 1.0, 1.0, 1.0, 4.0]
E   [right]: [683.0, 4.430453879941435, 2.74087303323353, 1.0, 2.0, 4.0, 5.5, 10.0]

Notice how the mean and the std are quite different from the intended results. Note that this is with use_max_modalities=False, which means we are using some randomness to pick the modalities for the categories.

What's interesting is that this test has values that are quite ok when we go back to a previous commit, 091ffdf. This is no coincidence, as 18afa43 , the commit after 091ffdf, introduced the concept of randomization.

We can reproduce this by setting use_max_modalities=True, because this uses no randomization. Using this, we get values that are closer to the MacOS M1 values, but not exactly, which given that there is a seed set, should be the same . And it's even stranger that two Linux machines with Intel processors do not have these the same.

@jpetot and I traced it back to the output of the SVD beeing (quite a bit) different on our respective architectures (we did not compare both Linux machines).

We looked around, and couldn't really find a cause to why the SVD was computing something different for both architectures, given that they have the same inputs (we checked). This numpy issue seems to say that having a matrix that's close to singular might lead to different eigenvectors being produced, but there is not a lot of other opinions out there (we haven't check thoroughly)

Here below you'll have the comparison of the distributions for every variable using use_max_modalities=True combined.txt

tcrasset commented 2 years ago

The preview of the V matrix after the SVD.

It seems that mainly the last rows are affected, which is for the last dimensions. For MacOS

  [[-1.26283112e-01 -6.40445458e-02 -7.18225448e-02 ...  8.57881145e-02
  -2.05762202e-01  2.80451737e-01]
 [ 4.99837195e-02  7.15111891e-03 -2.83560531e-02 ...  2.25232007e-01
   1.46423089e-02 -1.99573144e-02]
 [ 5.11562931e-02  2.19225451e-02  1.70491081e-03 ... -7.87385353e-02
   9.80054658e-03 -1.33580428e-02]
 ...
 [ 0.00000000e+00  4.52631048e-16 -2.80867795e-17 ...  2.58691621e-02
  -1.25065947e-01 -9.17585501e-02]
 [-0.00000000e+00 -3.46285389e-17 -2.97547552e-17 ... -1.01248535e-01
   4.77328637e-01  3.50207107e-01]
 [-0.00000000e+00 -1.48078167e-17  4.07243270e-17 ...  9.40522484e-02
   5.76828004e-01  4.23207935e-01]]

For Linux:

[[-1.26283112e-01 -6.40445458e-02 -7.18225448e-02 ...  8.57881145e-02
  -2.05762202e-01  2.80451737e-01]
 [ 4.99837195e-02  7.15111891e-03 -2.83560531e-02 ...  2.25232007e-01
   1.46423089e-02 -1.99573144e-02]
 [ 5.11562931e-02  2.19225451e-02  1.70491081e-03 ... -7.87385353e-02
   9.80054658e-03 -1.33580428e-02]
 ...
 [-0.00000000e+00  2.22765427e-16 -3.82654851e-16 ...  2.53331473e-02
  -3.47037214e-02 -2.54614723e-02]
 [-0.00000000e+00  1.49831051e-16  2.00377516e-16 ...  7.66084898e-03
  -1.37313641e-01 -1.00744454e-01]
 [-0.00000000e+00 -8.15852388e-17 -1.56691942e-16 ... -4.88311550e-02
  -4.30929081e-01 -3.16164620e-01]]
tcrasset commented 2 years ago

This stackoverflow question seems to indicate that the FORTRAN/C code called by the svd is an iterative random solver, which sets the randomness at the very first call, and then doesn't do it again. This may be the source of our non-determinism in SVD. Note however that the question handles svds, which is the sparse version of svd.

tcrasset commented 2 years ago

Don't know how valuable, but one solution might be to use numpy.linalg.svd instead of scipy.linalg.svd, which allows to build from source. This could potentially allow us to fix the BLAS libraries to be common between architectures at compile time. One (BIG) disadvantage is the compile time.

tcrasset commented 2 years ago

It seems like we are all running the OpenBLAS library for low-level routines, so the above command about compiling numpy from source is not needed.

Example (from Linux x86_64): poetry run python -c "import numpy as np; np.__config__.show()" (Although we are using scipy.linalg,svd, the above command gives the same results if we change numpy with scipy)

openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
    not found = AVX512_KNL,AVX512_KNM

and example for MacOS M1 chip :

openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/opt/arm64-builds/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/opt/arm64-builds/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/opt/arm64-builds/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/opt/arm64-builds/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/opt/arm64-builds/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/opt/arm64-builds/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/opt/arm64-builds/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/opt/arm64-builds/lib']
Supported SIMD extensions in this NumPy install:
    baseline = NEON,NEON_FP16,NEON_VFPV4,ASIMD
    found = ASIMDHP
    not found = ASIMDDP
tas17 commented 2 years ago

On M1: distributions_Darwin

tcrasset commented 2 years ago

To first gain a high level overview (we should have started with this), we plotted the distributions between the original and the inverse on different platforms. As shown above, @tas17 on M1 has distributions that are closer to the original than Linux.

In particular, we see that something very strange is happening for the Mitoses category. It seems like label 6 and 1 have been swapped. Moreover, in a general sense, we are removing many more outliers from the rare modalities on the Linux platform than on MacOS M1 (e.g. Single_Epithelial_Cell_Size or Bland_Chromatin) or the inverse (decreasing the occurrence of the most prominent modality) , e.g. Single_Epithelial_Cell_Size. distributions_Linux

tcrasset commented 2 years ago

Here you have the comparison between the reversed individuals on both machines. comparison_of_distributions_Linux

tcrasset commented 2 years ago

So, it turns out that the problem is actually somewhere else. We get the results we expect, only if the coordinates are computed on the same machine as they are evaluated. There is something about the SVD implementation that is architecture specific.

I've ran this on my Linux machine, and Archive 1 are the coordinates generated on my Linux machine, and Archive 2 are coordinates generated on a MacOS M1 machine. comparison_distributions_buld_and_generated_on_different_machines .