lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.39k stars 803 forks source link

UMAP Segmentation Faults #747

Open rhysnewell opened 3 years ago

rhysnewell commented 3 years ago

Hi,

I thought it might be best to just create a new issue, for this as the issue seems a little different to https://github.com/lmcinnes/umap/issues/421 which I originally commented this error on.

I've started getting a segfault when trying to play around with the numba threading layers (setting it tbb) in order to use UMAP with ProcessPoolExecutor. It happened very suddenly, and now consistently happens whenever I try to run UMAP inside a script, regardless of threading layer or if it is running inside a process pool.

The weird thing is that the seg fault does not occur if I just run UMAP inside of a python terminal, it only occurs when I run it via command line through a script.

The error looks like this on one set of data:

Fatal Python error: Segmentation fault

Current thread 0x00007f5c2431c700 (most recent call first):
  File "/home/n10853499/.conda/envs/rosella-dev/lib/python3.8/site-packages/umap/umap_.py", line 580 in fuzzy_simplicial_set
  File "/home/n10853499/.conda/envs/rosella-dev/lib/python3.8/site-packages/umap/umap_.py", line 2373 in fit
  File "/home/n10853499/.conda/envs/rosella-dev/lib/python3.8/site-packages/flight/rosella/embedding.py", line 405 in fit_transform
  File "/home/n10853499/.conda/envs/rosella-dev/lib/python3.8/site-packages/flight/rosella/rosella.py", line 248 in perform_binning
  File "/home/n10853499/.conda/envs/rosella-dev/lib/python3.8/site-packages/flight/flight.py", line 442 in bin
  File "/home/n10853499/.conda/envs/rosella-dev/lib/python3.8/site-packages/flight/flight.py", line 365 in main
  File "/home/n10853499/.conda/envs/rosella-dev/bin/flight", line 8 in <module>

And there is secondary error on another set of data that looks like this:

Fatal Python error: Segmentation fault

Thread 0x00007f22211a4700 (most recent call first):
  File "/home/n10853499/.conda/envs/rosella-dev/lib/python3.8/site-packages/pynndescent/pynndescent_.py", line 874 in __inFatal Python error: iSegmentation faultt

__
  File "/home/n10853499/.conda/envs/rosella-dev/lib/python3.8/site-packages/umap/umap_.py", line 328 in nearest_neighbors
  File "/home/n10853499/.conda/envs/rosella-dev/lib/python3.8/site-packages/umap/umap_.py", line 2415 in fit
  File "/home/n10853499/.conda/envs/rosella-dev/lib/python3.8/site-packages/flight/rosella/embedding.py", line 405 in fit_transform
  File "/home/n10853499/Segmentation fault (core dumped)

Downgrading numba doesn't help with this issue, nor does downgrading pynndescent or using the master branch from the pynndescent github. Additionally, this is happening on a fresh conda environment so something pretty odd seems to be happening.

and my conda environment looks like this:

# packages in environment at /home/n10853499/.conda/envs/rosella-dev:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
attrs                     21.2.0             pyhd8ed1ab_0    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
biopython                 1.79             py38h497a2fe_0    conda-forge
blis                      0.8.1                h7f98852_1    conda-forge
brotlipy                  0.7.0           py38h497a2fe_1001    conda-forge
bwa                       0.7.17               h5bf99c6_8    bioconda
bzip2                     1.0.8                h7f98852_4    conda-forge
ca-certificates           2021.5.30            ha878542_0    conda-forge
cachecontrol              0.12.6                     py_0    conda-forge
certifi                   2021.5.30        py38h578d9bd_0    conda-forge
cffi                      1.14.4           py38ha312104_0    conda-forge
chardet                   4.0.0            py38h578d9bd_1    conda-forge
charset-normalizer        2.0.0              pyhd8ed1ab_0    conda-forge
cryptography              3.4.7            py38ha5dfef3_0    conda-forge
curl                      7.71.1               he644dc0_3    conda-forge
cycler                    0.10.0                     py_2    conda-forge
cython                    0.29.24          py38h709712a_0    conda-forge
decorator                 5.0.9              pyhd8ed1ab_0    conda-forge
flight-genome             1.2.1              pyh5e36f6f_0    bioconda
freetype                  2.10.4               h0708190_1    conda-forge
gsl                       2.6                  he838d99_2    conda-forge
hdbscan                   0.8.27           py38h5c078b8_0    conda-forge
hdmedians                 0.14.2           py38hb5d20a5_0    conda-forge
htslib                    1.9                  h4da6232_3    bioconda
idna                      3.1                pyhd3deb0d_0    conda-forge
imageio                   2.9.0                      py_0    conda-forge
iniconfig                 1.1.1              pyh9f0ad1d_0    conda-forge
ipython                   7.26.0           py38he5a9106_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
jedi                      0.18.0           py38h578d9bd_2    conda-forge
joblib                    0.17.0                     py_0    conda-forge
jpeg                      9d                   h36c2ea0_0    conda-forge
k8                        0.2.5                h9a82719_1    bioconda
kiwisolver                1.3.1            py38h1fd1430_1    conda-forge
krb5                      1.17.2               h926e7f8_0    conda-forge
lcms2                     2.12                 hddcbb42_0    conda-forge
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
libcblas                  3.9.0               10_openblas    conda-forge                                                                                  [38/1790]
libcurl                   7.71.1               hcdd3856_3    conda-forge
libdeflate                1.6                  h516909a_0    conda-forge
libedit                   3.1.20191231         h46ee950_2    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 11.1.0               hc902ee8_8    conda-forge
libgfortran-ng            11.1.0               h69a702a_8    conda-forge
libgfortran5              11.1.0               h6c583b3_8    conda-forge
libgomp                   11.1.0               hc902ee8_8    conda-forge
liblapack                 3.9.0               10_openblas    conda-forge
libllvm10                 10.0.1               he513fc3_3    conda-forge
libopenblas               0.3.17          pthreads_h8fe5266_1    conda-forge
libpng                    1.6.37               h21135ba_2    conda-forge
libssh2                   1.9.0                ha56f1ee_6    conda-forge
libstdcxx-ng              11.1.0               h56837e0_8    conda-forge
libtiff                   4.3.0                hf544144_0    conda-forge
libwebp-base              1.2.0                h7f98852_2    conda-forge
llvmlite                  0.36.0           py38h4630a5e_0    conda-forge
lockfile                  0.12.2                     py_1    conda-forge
lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
matplotlib-base           3.4.2            py38hcc49a3a_0    conda-forge
matplotlib-inline         0.1.2              pyhd8ed1ab_2    conda-forge
minimap2                  2.21                 h5bf99c6_0    bioconda
more-itertools            8.8.0              pyhd8ed1ab_0    conda-forge
msgpack-python            1.0.2            py38h1fd1430_1    conda-forge
natsort                   7.1.1              pyhd8ed1ab_0    conda-forge
ncurses                   6.1               hf484d3e_1002    conda-forge
numba                     0.53.1           py38h8b71fd7_1    conda-forge
numpy                     1.21.1           py38h9894fe3_0    conda-forge
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openblas                  0.3.17          pthreads_h4748800_1    conda-forge
openjpeg                  2.4.0                hb52868f_1    conda-forge
openssl                   1.1.1k               h7f98852_0    conda-forge
packaging                 21.0               pyhd8ed1ab_0    conda-forge
pandas                    1.3.1            py38h1abd341_0    conda-forge
parallel                  20160622                      1    bioconda
parso                     0.8.2              pyhd8ed1ab_0    conda-forge
patsy                     0.5.1                      py_0    conda-forge
perl                      5.32.1          0_h7f98852_perl5    conda-forge
perl-threaded             5.26.0                        0    bioconda
pexpect                   4.8.0              pyh9f0ad1d_2    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    8.3.1            py38h8e6f84c_0    conda-forge
pip                       21.2.2             pyhd8ed1ab_0    conda-forge
pkg-config                0.29.2            h36c2ea0_1008    conda-forge
pluggy                    0.13.1           py38h578d9bd_4    conda-forge
prompt-toolkit            3.0.19             pyha770c72_0    conda-forge
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
py                        1.10.0             pyhd3deb0d_0    conda-forge
pycparser                 2.20               pyh9f0ad1d_2    conda-forge
pygments                  2.9.0              pyhd8ed1ab_0    conda-forge
pynndescent               0.5.4              pyh6c4a22f_0    conda-forge
pyopenssl                 20.0.1             pyhd8ed1ab_0    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pysam                     0.16.0.1         py38hbdc2ae9_1    bioconda
pysocks                   1.7.1            py38h578d9bd_3    conda-forge
pytest                    6.2.4            py38h578d9bd_0    conda-forge
python                    3.8.5           h4d41432_2_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python_abi                3.8                      2_cp38    conda-forge
pytz                      2021.1             pyhd8ed1ab_0    conda-forge
readline                  8.0                  h46ee950_1    conda-forge
requests                  2.26.0             pyhd8ed1ab_0    conda-forge
rosella                   0.3.3                h443a992_0    bioconda
samtools                  1.9                 h10a08f8_12    bioconda
scikit-bio                0.5.6            py38h0b5ebd8_4    conda-forge
scikit-learn              0.24.2           py38hdc147b9_0    conda-forge
scipy                     1.7.1            py38h56a6a73_0    conda-forge
seaborn                   0.11.1               hd8ed1ab_1    conda-forge
seaborn-base              0.11.1             pyhd8ed1ab_1    conda-forge
setuptools                49.6.0           py38h578d9bd_3    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sqlite                    3.32.3               hcee41ef_1    conda-forge
starcode                  1.4                  h779adbc_1    bioconda
statsmodels               0.12.2           py38h5c078b8_0    conda-forge
tbb                       2020.2               h4bd325d_4    conda-forge
threadpoolctl             2.2.0              pyh8a188c0_0    conda-forge
tk                        8.6.10               h21135ba_1    conda-forge
toml                      0.10.2             pyhd8ed1ab_0    conda-forge
tornado                   6.1              py38h497a2fe_1    conda-forge
traitlets                 5.0.5                      py_0    conda-forge
umap-learn                0.5.1            py38h578d9bd_1    conda-forge
urllib3                   1.26.6             pyhd8ed1ab_0    conda-forge
vt                        2015.11.10           he941832_3    bioconda
wcwidth                   0.2.5              pyh9f0ad1d_2    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.5.0                ha95c52a_0    conda-forge
rhysnewell commented 3 years ago

Just a bit more background, the reason I was trying to use tbb in the first place was because I was following the advice in this issue: https://github.com/lmcinnes/umap/issues/707 but since these segfaults are occurring only when the threading layer is set to tbb this might not be so much of a priority for you. Understandably, these can be difficult to track down so any help would be appreciated :)

rhysnewell commented 3 years ago

So good news, I managed to fix this although I'm not entirely sure how. All I did was create another fresh conda environment and run everything in that environment with out setting the tbb threading layer. All of the segfaults went away and I can could run UMAP inside of a ProcessPoolExecutor significantly speeding up my pipeline! Awesome stuff.

I'm sorry I don't have a more detailed solution to this to whoever comes across this in future. The only thing that worked for me was creating a fresh environment and try not to touch the numba threading layer

rhysnewell commented 3 years ago

Okay, sorry to re-open this but turns out this is a continuing issue. It seems to just randomly happen on certain datasets with no consistency. The latest error message is pointing to pynndescent again:

Python error: Segmentation fault\n\nCurrent thread 0x00007f3eb2827700 (most recent call first):
File "/home/n10853499/.conda/envs/rosella-dev/lib/python3.8/site-packages/pynndescent/pynndescent_.py", line 876 in __init__

Additionally, I have found cases where it occurs inside of python terminals and not inside of a script. Back to the drawing board.

lmcinnes commented 3 years ago

Thanks for the effort to track this down -- intermittent issues are the hardest to resolve. Hopefully a small consistent reproducer can be found.

rhysnewell commented 3 years ago

Hi @lmcinnes,

I've not come across a reliable reproducer as of yet, but I think I have found the root cause. Issue https://github.com/numba/numba/issues/4807 for numba outlines some as of yet unsolved behaviour when using cached functions inside of ProcessPool. This fits with some of the testing I've managed to do as the segfaults seem to begin happening as my conda environment "ages" with use, the segfaults can then be temporarily stopped by reinstalling pynndescent from the main branch on GitHub.

I'm not sure what the best solution here is. I do realise that the way that I'm using UMAP is perhaps an edge case scenario (Multiple models being calculated in parallel) but it also seems like numba caching is a little bit unstable and brittle. It would be great to have a use case for pynndescent where we could turn off caching for all njit functions but that might be unfeasible. Perhaps a separate release of pynndescent that removes all caching?

Cheers, Rhys

rhysnewell commented 3 years ago

Just a quick follow up, I don't have a new reproducible example but I believe the reproducer provided in https://github.com/lmcinnes/umap/issues/707 would suffice since I think this is exactly the same issue. The problem definitely lies within the numba caching that occurs within the pynndescent library. I've forked pynndescent and changed all occurrences of cache=True to cache=False and I've not come across anymore segfaults. This even fixed the segfaulting issue on datasets that would almost consistently segfault.

The caching does not seem to significantly speed up loading pynndescent or the umap libraries. I'd suggest maybe removing the caching option as it might be the root cause of a lot of segfault issues people have been experiencing.

Would removing the caching be something you would consider for an upcoming release?

Cheers, Rhys

lmcinnes commented 3 years ago

The caching does not speed up the runtime performance, but it does speed up the import time performance -- there were complaints that pynndescent took too long to import, so adding caching was the solution. It seems like caching is an imperfect solution however.

rhysnewell commented 3 years ago

I may be biased, but I find the import time for non-cached pynndescent to perfectly acceptable :P Importing the non-cached version of pynndescent doesn't seem any slower than the cached version in my opinion, maybe numba has improved their time to compile since those original complaints?

lmcinnes commented 3 years ago

I also had no real issues with the import time myself, but I think there are a number of use cases where shaving 5 seconds off the import time is quite significant. I'll see if I can scale back the amount of caching, because I agree that it can apparently be problematic.

stuartarchibald commented 2 years ago

It looks like this might be Numba caching related. Is the number threads in use changing between the first compilation and the cache replay?

stuartarchibald commented 2 years ago

xref: https://github.com/numba/numba/pull/7522

rhysnewell commented 2 years ago

Hey @lmcinnes,

Stuart is correct. I've just tested it on a fresh install of pynndescent, change the availble numba threads between sessions causes segfaults. I notice that there is a new pynndescent version on the way, is there any plan to remove the caching for now or will you be keeping it? Theoretically it should only be numba functions that implement both parallel=True and cache=True that would need to change

jasperhyp commented 2 years ago

Has this issue been fixed? We are also running into this issue when setting metric='cosine' but no problem when setting metric='euclidean'. Not sure how to probe into this issue though.

stuartarchibald commented 2 years ago

The specific issue to do with changing the number of threads in use between first compilation and cache replay has been fixed in https://github.com/numba/numba/pull/7625 and is in the 0.56.x release series of Numba. Updating the Numba version in your environment to 0.56.x should mitigate this specific issue.

I guess this issue could be marked as fixed/closed were UMAP to require 0.56.x onward as a dependency, or, were UMAP's use of @jit(cache=True) only enabled if Numba 0.56.x onward is present.

edmoman commented 1 year ago

I can confirm that this bug is still present and that it is linked to the cosine metric.

It does not happen with any other metric I have tested, only cosine (Manhattan or euclidean work just fine).

I am able to reproduce it consistently in Windows and Ubuntu within WSL. However, the database I am using is confidential.

It is one of the weirdest bugs I have ever come across. I import my data set from a CSV file as a pandas dataframe. In the middle of the dataframe there is a column containing a date. This column is converted to a scalar prior to running UMAP.

If the column is there, UMAP is happy. If I remove it, it segfaults.

It does not matter how I remove it, beforehand from the file, excluding it a reading time or dropping it from the dataframe.

If the column is missing UMAP crashes. This is the trace:

[New Thread 0x7fff800d1640 (LWP 1086)] [New Thread 0x7fff7f8d0640 (LWP 1087)] [New Thread 0x7fff7f0cf640 (LWP 1088)] [New Thread 0x7fff7e8ce640 (LWP 1089)] [New Thread 0x7fff7e0cd640 (LWP 1090)] [New Thread 0x7fff7d8cc640 (LWP 1091)] [New Thread 0x7fff7d0cb640 (LWP 1092)] [New Thread 0x7fff7c8ca640 (LWP 1093)] [New Thread 0x7fff5ffff640 (LWP 1094)] Thread 16 "python" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fff7e0cd640 (LWP 1090)] _int_malloc (av=av@entry=0x7fff6c000030, bytes=bytes@entry=116) at ./malloc/malloc.c:1362 1362 ./malloc/malloc.c: No such file or directory.

edmoman commented 1 year ago

Setting the environmental variable NUMBA_DISABLE_JIT to 1 prevents the segfault. However, it also makes things very slow. Would there be a less radical workaround than disabling jit entirely?

stuartarchibald commented 1 year ago

@edmoman thanks for raising this. Unfortunately, without a code sample to reproduce the original issue or the issue you are encountering it's hard to work out if the cause of the problem is the same. Which version of Numba are you using? If it is 0.56.x then the problem above to do with cache replay and threads used at compile time vs. run time has been fixed and what you are seeing is potentially something different.

edmoman commented 1 year ago

I understand this is hard to debug. Unfortunately I cannot share my dataset. It may or may not be the same issue. But all the symptoms are the same. For instance, it only happens with the cosine metric. What I find extremely weird is that, in my dataset I can remove any other column, no problem, but if I touch that particular one it segfaults. For the time being I am going to switch to a different metric, even though cosine seems to provide the best results.

edmoman commented 1 year ago

Ah, correlation also segfaults. But euclidean, manhattan and camberra are fine.

stuartarchibald commented 1 year ago

@edmoman it's understandable and a common issue that a dataset cannot be shared. In such a case a mock dataset may be employed.

Could you please confirm the Numba version you are using? Thanks.

edmoman commented 1 year ago

I am using the latest pip versions of all the packages:

Name: numba Version: 0.56.4

stuartarchibald commented 1 year ago

@edmoman thanks for confirming. As this is on the Windows operating system, I wonder if this Numba issue may be involved https://github.com/numba/numba/issues/8705? Perhaps set the environment variable to disable the use of SVML as suggested here https://github.com/numba/numba/issues/8705#issuecomment-1380124101 and see if that helps.

edmoman commented 1 year ago

Interesting, thank you. Disabling SVML does not solve the issue. But I am going to try the code on a Linux server and see what happens.

edmoman commented 1 year ago

No, the error is not Windows or WSL-specific. It can be reproduced in native Linux.

stuartarchibald commented 1 year ago

@edmoman thanks for checking. On that linux system (which I presume didn't have a Numba cache built already) did it run ok on the first attempt and then fail on subsequent, or fail on the first attempt? Am asking to see if the issue is cache related.

edmoman commented 1 year ago

It failed on the first and subsequent attempts. It also fails with correlation and cosine but not with other metrics. On that machine numba and UMAP where not even installed. So no cache.

stuartarchibald commented 1 year ago

It failed on the first and subsequent attempts. It also fails with correlation and cosine but not with other metrics. On that machine numba and UMAP where not even installed. So no cache.

Ok, thanks for checking. I think this means that you are seeing a problem that is likely different to that which was reported by the OP (which was Numba doing cache replay with changing number of threads present). So as to separate concerns, perhaps open a new issue with the problem you are experiencing and some guidance can be offered there?

edmoman commented 1 year ago

Ok, I will do that. Thanks.