Open rhysnewell opened 3 years ago
Just a bit more background, the reason I was trying to use tbb
in the first place was because I was following the advice in this issue: https://github.com/lmcinnes/umap/issues/707 but since these segfaults are occurring only when the threading layer is set to tbb
this might not be so much of a priority for you. Understandably, these can be difficult to track down so any help would be appreciated :)
So good news, I managed to fix this although I'm not entirely sure how. All I did was create another fresh conda environment and run everything in that environment with out setting the tbb
threading layer. All of the segfaults went away and I can could run UMAP inside of a ProcessPoolExecutor significantly speeding up my pipeline! Awesome stuff.
I'm sorry I don't have a more detailed solution to this to whoever comes across this in future. The only thing that worked for me was creating a fresh environment and try not to touch the numba threading layer
Okay, sorry to re-open this but turns out this is a continuing issue. It seems to just randomly happen on certain datasets with no consistency. The latest error message is pointing to pynndescent
again:
Python error: Segmentation fault\n\nCurrent thread 0x00007f3eb2827700 (most recent call first):
File "/home/n10853499/.conda/envs/rosella-dev/lib/python3.8/site-packages/pynndescent/pynndescent_.py", line 876 in __init__
Additionally, I have found cases where it occurs inside of python terminals and not inside of a script. Back to the drawing board.
Thanks for the effort to track this down -- intermittent issues are the hardest to resolve. Hopefully a small consistent reproducer can be found.
Hi @lmcinnes,
I've not come across a reliable reproducer as of yet, but I think I have found the root cause. Issue https://github.com/numba/numba/issues/4807 for numba outlines some as of yet unsolved behaviour when using cached functions inside of ProcessPool
. This fits with some of the testing I've managed to do as the segfaults seem to begin happening as my conda environment "ages" with use, the segfaults can then be temporarily stopped by reinstalling pynndescent
from the main branch on GitHub.
I'm not sure what the best solution here is. I do realise that the way that I'm using UMAP is perhaps an edge case scenario (Multiple models being calculated in parallel) but it also seems like numba caching is a little bit unstable and brittle. It would be great to have a use case for pynndescent
where we could turn off caching for all njit functions but that might be unfeasible. Perhaps a separate release of pynndescent
that removes all caching?
Cheers, Rhys
Just a quick follow up, I don't have a new reproducible example but I believe the reproducer provided in https://github.com/lmcinnes/umap/issues/707 would suffice since I think this is exactly the same issue. The problem definitely lies within the numba caching that occurs within the pynndescent
library. I've forked pynndescent
and changed all occurrences of cache=True
to cache=False
and I've not come across anymore segfaults. This even fixed the segfaulting issue on datasets that would almost consistently segfault.
The caching does not seem to significantly speed up loading pynndescent
or the umap
libraries. I'd suggest maybe removing the caching option as it might be the root cause of a lot of segfault issues people have been experiencing.
Would removing the caching be something you would consider for an upcoming release?
Cheers, Rhys
The caching does not speed up the runtime performance, but it does speed up the import time performance -- there were complaints that pynndescent took too long to import, so adding caching was the solution. It seems like caching is an imperfect solution however.
I may be biased, but I find the import time for non-cached pynndescent to perfectly acceptable :P Importing the non-cached version of pynndescent doesn't seem any slower than the cached version in my opinion, maybe numba has improved their time to compile since those original complaints?
I also had no real issues with the import time myself, but I think there are a number of use cases where shaving 5 seconds off the import time is quite significant. I'll see if I can scale back the amount of caching, because I agree that it can apparently be problematic.
It looks like this might be Numba caching related. Is the number threads in use changing between the first compilation and the cache replay?
Hey @lmcinnes,
Stuart is correct. I've just tested it on a fresh install of pynndescent, change the availble numba threads between sessions causes segfaults. I notice that there is a new pynndescent version on the way, is there any plan to remove the caching for now or will you be keeping it? Theoretically it should only be numba functions that implement both parallel=True
and cache=True
that would need to change
Has this issue been fixed? We are also running into this issue when setting metric='cosine'
but no problem when setting metric='euclidean'
. Not sure how to probe into this issue though.
The specific issue to do with changing the number of threads in use between first compilation and cache replay has been fixed in https://github.com/numba/numba/pull/7625 and is in the 0.56.x release series of Numba. Updating the Numba version in your environment to 0.56.x should mitigate this specific issue.
I guess this issue could be marked as fixed/closed were UMAP to require 0.56.x onward as a dependency, or, were UMAP's use of @jit(cache=True)
only enabled if Numba 0.56.x onward is present.
I can confirm that this bug is still present and that it is linked to the cosine metric.
It does not happen with any other metric I have tested, only cosine (Manhattan or euclidean work just fine).
I am able to reproduce it consistently in Windows and Ubuntu within WSL. However, the database I am using is confidential.
It is one of the weirdest bugs I have ever come across. I import my data set from a CSV file as a pandas dataframe. In the middle of the dataframe there is a column containing a date. This column is converted to a scalar prior to running UMAP.
If the column is there, UMAP is happy. If I remove it, it segfaults.
It does not matter how I remove it, beforehand from the file, excluding it a reading time or dropping it from the dataframe.
If the column is missing UMAP crashes. This is the trace:
[New Thread 0x7fff800d1640 (LWP 1086)] [New Thread 0x7fff7f8d0640 (LWP 1087)] [New Thread 0x7fff7f0cf640 (LWP 1088)] [New Thread 0x7fff7e8ce640 (LWP 1089)] [New Thread 0x7fff7e0cd640 (LWP 1090)] [New Thread 0x7fff7d8cc640 (LWP 1091)] [New Thread 0x7fff7d0cb640 (LWP 1092)] [New Thread 0x7fff7c8ca640 (LWP 1093)] [New Thread 0x7fff5ffff640 (LWP 1094)] Thread 16 "python" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fff7e0cd640 (LWP 1090)] _int_malloc (av=av@entry=0x7fff6c000030, bytes=bytes@entry=116) at ./malloc/malloc.c:1362 1362 ./malloc/malloc.c: No such file or directory.
Setting the environmental variable NUMBA_DISABLE_JIT to 1 prevents the segfault. However, it also makes things very slow. Would there be a less radical workaround than disabling jit entirely?
@edmoman thanks for raising this. Unfortunately, without a code sample to reproduce the original issue or the issue you are encountering it's hard to work out if the cause of the problem is the same. Which version of Numba are you using? If it is 0.56.x then the problem above to do with cache replay and threads used at compile time vs. run time has been fixed and what you are seeing is potentially something different.
I understand this is hard to debug. Unfortunately I cannot share my dataset. It may or may not be the same issue. But all the symptoms are the same. For instance, it only happens with the cosine metric. What I find extremely weird is that, in my dataset I can remove any other column, no problem, but if I touch that particular one it segfaults. For the time being I am going to switch to a different metric, even though cosine seems to provide the best results.
Ah, correlation also segfaults. But euclidean, manhattan and camberra are fine.
@edmoman it's understandable and a common issue that a dataset cannot be shared. In such a case a mock dataset may be employed.
Could you please confirm the Numba version you are using? Thanks.
I am using the latest pip versions of all the packages:
Name: numba Version: 0.56.4
@edmoman thanks for confirming. As this is on the Windows operating system, I wonder if this Numba issue may be involved https://github.com/numba/numba/issues/8705? Perhaps set the environment variable to disable the use of SVML as suggested here https://github.com/numba/numba/issues/8705#issuecomment-1380124101 and see if that helps.
Interesting, thank you. Disabling SVML does not solve the issue. But I am going to try the code on a Linux server and see what happens.
No, the error is not Windows or WSL-specific. It can be reproduced in native Linux.
@edmoman thanks for checking. On that linux system (which I presume didn't have a Numba cache built already) did it run ok on the first attempt and then fail on subsequent, or fail on the first attempt? Am asking to see if the issue is cache related.
It failed on the first and subsequent attempts. It also fails with correlation and cosine but not with other metrics. On that machine numba and UMAP where not even installed. So no cache.
It failed on the first and subsequent attempts. It also fails with correlation and cosine but not with other metrics. On that machine numba and UMAP where not even installed. So no cache.
Ok, thanks for checking. I think this means that you are seeing a problem that is likely different to that which was reported by the OP (which was Numba doing cache replay with changing number of threads present). So as to separate concerns, perhaps open a new issue with the problem you are experiencing and some guidance can be offered there?
Ok, I will do that. Thanks.
This happens when running UMAP inside of Dagster, even with the euclidean metric. Disabling JIT in numba works, but it slows things down to the point that it's unusable. Tried on 0.58.x and 0.60.
Hi,
I thought it might be best to just create a new issue, for this as the issue seems a little different to https://github.com/lmcinnes/umap/issues/421 which I originally commented this error on.
I've started getting a segfault when trying to play around with the numba threading layers (setting it
tbb
) in order to use UMAP with ProcessPoolExecutor. It happened very suddenly, and now consistently happens whenever I try to run UMAP inside a script, regardless of threading layer or if it is running inside a process pool.The weird thing is that the seg fault does not occur if I just run UMAP inside of a python terminal, it only occurs when I run it via command line through a script.
The error looks like this on one set of data:
And there is secondary error on another set of data that looks like this:
Downgrading numba doesn't help with this issue, nor does downgrading pynndescent or using the master branch from the pynndescent github. Additionally, this is happening on a fresh conda environment so something pretty odd seems to be happening.
and my conda environment looks like this: