Segfault obesrved - Githubissues

nlw0 commented 1 year ago

Got a segfault today, seems related to this package. Does anyone know how this could be even possible? Are there any external binary dependencies in this package?

[52504] signal (11.1): Segmentation fault
in expression starting at none:1
size at ./array.jl:150 [inlined]
axes at ./abstractarray.jl:98 [inlined]
to_indices at ./indices.jl:344 [inlined]
view at ./subarray.jl:176 [inlined]
getindex at /home/user/.julia/packages/SimilaritySearch/g8sVc/src/db/matrixdatabase.jl:22 [inlined]
solve_single_query at /home/user/.julia/packages/SimilaritySearch/g8sVc/src/SimilaritySearch.jl:166
macro expansion at /home/user/.julia/packages/SimilaritySearch/g8sVc/src/SimilaritySearch.jl:157 [inlined]
#116 at /home/user/.julia/packages/Polyester/qUpZt/src/closure.jl:279 [inlined]
BatchClosure at /home/user/.julia/packages/Polyester/qUpZt/src/batch.jl:12
unknown function (ip: 0x7f012d01e37f)
_call at /home/user/.julia/packages/ThreadingUtilities/BEAGi/src/threadtasks.jl:11 [inlined]
ThreadTask at /home/user/.julia/packages/ThreadingUtilities/BEAGi/src/threadtasks.jl:29
unknown function (ip: 0x7f012d02c4cf)
_jl_invoke at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
start_task at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/task.c:1092
Allocations: 104882200 (Pool: 104491267; Big: 390933); GC: 188
/home/user/.local/bin/julia: line 2: 52504 Segmentation fault      (core dumped) /home/user/src/julia-1.9.0/bin/julia --threads=auto "$@"

sadit commented 1 year ago

No, but it has many @inbounds, i.e., if you access a database out of bounds, it will fail. Try running Julia with --check-bounds=no to find the error without any segmentation fault. On the other hand, we use Polyester.jl and StrideArraysCore.jl, which use several low-level tricks to achieve high performance. In particular, its @batch macro also changes arrays for PtrArrays; in particular, the latter uses several tricks that can produce segmentation faults.

If you think that failure is related to StrideArrays or Polyester, then you should use StrideArraysCore.boundscheck() = true (inside the running Julia process) as documented in https://github.com/JuliaSIMD/StrideArraysCore.jl

Based on your trace, I guess you are accessing object identifiers beyond the bounds of the dataset... but it is hard to know without a minimal working example

nlw0 commented 1 year ago

Thanks, from what I have now seen it looks like accessing out of bounds indeed, but I'm merely running

searchbatch(ExhaustiveSearch(BinaryHammingDistance(), MatrixDatabase(db)), MatrixDatabase(da), 2)

And later I use the returned indices. It seems the out of bounds error would be at that moment, what would indicate invalid indices in the answer. On the other hand, it should only cause a regular out-of-bounds error from accessing a regular Julia array, there should be no relation to Polyester or the MatrixDatabase object anymore. I tried creating a MWE but it runs ok, only in my application it happens now and then.

sadit commented 1 year ago

Since the failure is outside of the Polyester domain, you can see the index that fails to turn off the @inbounds macro (run Julia with --check-bounds=yes).

You can also check the output ranges with the extrema function on the knn indices returned by searchbatch.

If you see an unexpected 0, you may have reached a documented, but perhaps bad, implementation decision. Queries fetching less than $k$ neighbors pad the end of the column with zeros. In other words, the search batch pads the column with zeros instead of failing or filling it with random results.

https://sadit.github.io/SimilaritySearch.jl/dev/api/#SimilaritySearch.searchbatch

This should not happen in most workloads if you use ExhaustiveSearch, only if you ask for $k \geq n$, where $n$ is the length of the dataset. However, using SearchGraph has more cases of failing to retrieve k elements. For instance, this can happen when the $k$ is large and the query is pretty easy to solve. Another possible cause (again with SearcGraph) is whenever discrete distance functions don't allow routing and the algorithm behind don't find suitable candidates to explore the underlying graph.

some possible workarounds you can apply, depending on your application:

Check for 0s at the end of knns columns in your app
Use search instead of searchbatch, search solves only individual queries and uses a KnnResult structure that allows more control.
If you really need to retrieve the k elements you ask, then you need to manipulate the hyperparameters of SearchGraph; for example, add different hints strategies through callbacks kwarg in the insertion and indexing functions. This will let you specify a small set of elements that should be revised always, and that should be similar to the number of k elements you expect to fetch.

You can see a working example with binary hamming in https://github.com/sisap-challenges/sisap23-laion-challenge-julia-example/blob/c89913792c6fe8aaa27cc3766a42ba34dccaf387/src/run.jl#L100

nlw0 commented 1 year ago

I'm investigating this again. Unfortunately it's not easy to reproduce, I imagine might be related to threads. I upgraded to the latest version but it still happens. Here's the latest stacktrace I could capture:

[79918] signal (11.1): Segmentation fault
in expression starting at none:1
size at ./array.jl:150 [inlined]
axes at ./abstractarray.jl:98 [inlined]
to_indices at ./indices.jl:344 [inlined]
view at ./subarray.jl:176 [inlined]
getindex at /home/user/.julia/packages/SimilaritySearch/h5gkj/src/db/matrixdatabase.jl:22 [inlined]
solve_single_query at /home/user/.julia/packages/SimilaritySearch/h5gkj/src/SimilaritySearch.jl:185
macro expansion at /home/user/.julia/packages/SimilaritySearch/h5gkj/src/SimilaritySearch.jl:176 [inlined]
#124 at /home/user/.julia/packages/Polyester/qUpZt/src/closure.jl:279 [inlined]
BatchClosure at /home/user/.julia/packages/Polyester/qUpZt/src/batch.jl:12
unknown function (ip: 0x7f4f6534d2df)
_call at /home/user/.julia/packages/ThreadingUtilities/3z3g0/src/threadtasks.jl:11 [inlined]
ThreadTask at /home/user/.julia/packages/ThreadingUtilities/3z3g0/src/threadtasks.jl:29
unknown function (ip: 0x7f4f6536d5df)
_jl_invoke at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
start_task at /cache/build/default-amdci4-0/julialang/julia-release-1-dot-9/src/task.c:1092
Allocations: 360068756 (Pool: 357671085; Big: 2397671); GC: 1486
/home/user/.local/bin/julia: line 2: 79918 Segmentation fault      (core dumped) /home/user/src/julia-1.9.0/bin/julia --threads=auto "$@"

My impression is that some value in the return matrix does not get initialized, so I get some junk values, I've even seen negative values. There are two situations, either searchbatch works, but returns an invalid output where there's an index that is some weird number (probably read from non-initialized memory), or sometimes this invalid index also shows up inside SimilaritySearch. So sometimes I get an error in my code, and sometimes in SimilaritySearch, more specifically at https://github.com/sadit/SimilaritySearch.jl/blob/aebe1ff6751df1c089ddfd8acab74c6ba86691e7/src/db/matrixdatabase.jl#L22 , called from https://github.com/sadit/SimilaritySearch.jl/blob/aebe1ff6751df1c089ddfd8acab74c6ba86691e7/src/SimilaritySearch.jl#L185 and https://github.com/sadit/SimilaritySearch.jl/blob/aebe1ff6751df1c089ddfd8acab74c6ba86691e7/src/SimilaritySearch.jl#L176

All I can imagine is that somehow @batch minbatch=minbatch per=thread for i in eachindex(Q) could be producing invalid values for i somehow. But I'm not familiar with Polyester at all to determine if it might be possible.

chriselrod commented 1 year ago

Can you try starting julia with --check-bounds=yes?

exAClior commented 2 months ago

I met a similar problem. I am trying to use @threads for doing multiple Approximate Nearest Neighbor search tasks on distinct databases. Within each database, @batch was used for speeding up the work. I observed Segmentation Fault in this use case. Using @batch instead of @threads Segmentation Fault now goes away. I am curious to why doing the later resolves the issue.

sadit / SimilaritySearch.jl

Segfault obesrved #31