rapidsai / raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.
https://docs.rapids.ai/api/raft/stable/
Apache License 2.0
744 stars 189 forks source link

[FEA] Use RBC to compute Epsilon neighborhoods #517

Open cjnolet opened 2 years ago

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

mfoerste4 commented 10 months ago

@cjnolet, I would like to revive this issue to start a discussion on how the C++ API should be designed. I already built a prototype on top of the rbc k-nn code that allows neighborhood construction with an eps-radius, but there are open design questions:

  1. Dense/CSR: The current API for eps-nn epsUnexpL2SqNeighborhood using brute-force creates a dense adjacency matrix. Applications with sparse connectivity patterns would benefit from a sparse adjacency matrix representation, therefore CSR would be preferable for the rbc API (and maybe also for the brute force approach).

  2. Memory management(CSR): There is large uncertainty in memory requirements for the resulting adjacency matrix. Given the user chooses a large eps, we end up with a dense matrix which might not fit into memory. Being accurate on predicting a maximum size here would also allow for optimizations w.r.t. query batch sizes. a) always require two passes, first one counts non-zeroes and second one persists them. b) pre-allocate memory and pass maximum size allowed - in case the result does not fit return with error status. c) pre-allocate memory add pass k-parameter to limit non-zeroes per resulting row. At most the first neighbors found (not necessarily the nearest ones). With a) being very costly w.r.t. the distance computations and b) being a design pattern not used in raft (to my knowledge) I suppose c) is the best candidate here, although there might be different formulations to somehow limit the overall connections collected for the result.

This is also related to #1984 which covers exposure in pylibraft for these APIs.

CC @tfeher

cjnolet commented 10 months ago

Hey @mfoerste4, RBC functions similarly to IVF methods, but with the added benefit that it knows the number of probes that need to be selected in order to find the exact nearest neighbors. The other (more subtle) benefit to the algortihm is that since we know the closest IVF clusters that we need to scan through, we can also issue a warning or raise an appropriate error up front.

I don't see your first item above as being prohibitive to constructing a CSR- the reason being that the adjacency list / mask is boolean, with a naive implementation taking up 1/8th the space of a single-precision pairwise matrix, but also because we can build the adjacency matrix in batches if memory becomes an issue. Another option could also be to use a bitmap instead of a boolean matrix (this will require atomics, but at least each value only needs to be written once).

The current epsilon neighbors primitive constructs a boolean mask and vertex degree matrix, which essentially can be used to construct the indptr and column arrays of the CSR. Only if a user needs the distances would a second pass be needed, and in this case, I don't think we would want to do a second pass w/ the RBC algorithm, but we would want to expose (and use) a masked matrix multiplication primitive like cusparse's SDDMM to compute the final distances. This detail is important- while in dbscan we want the distances, not all user want that, but the good news is that we have primitives that are optimized for computing selective distances (at least for dot-product-based distances) and so we don't need to run a full GEMM to get them.

If there are users that set their epsilon poorly and end up with a full pairwise distance matrix, we can let them know this up front by testing how many RBC clusters are within eps and the total number of points that belong to those closest balls. We could even go a step further to make the heuristic more accurate and use the variance of the balls to determine the confidence ranges. For example, this would allow us to issue a message like "WARNING: The requested eps will cause an adjacency list that is XXX dense. This is going to put all points into the same cluster". The benefit here is that we can implement this message in DBSCAN itself, just based on the adjacency matrix.

BTW, for further proof that not all users care about the distances of an epsilon neighborhood, https://github.com/rapidsai/raft/issues/1984 is a direct request form a user who only wants the neighbors (e.g. adjacency matrix) returned and not the actual distances.

mfoerste4 commented 10 months ago

Hi @cjnolet , thanks for your detailed comments.

I don't see your first item above as being prohibitive to constructing a CSR- the reason being that the adjacency list / mask is boolean, with a naive implementation taking up 1/8th the space of a single-precision pairwise matrix, but also because we can build the adjacency matrix in batches if memory becomes an issue.

The driving use-case for my prototype are experiments with DBSCAN, where the resulting adjacency matrices are very sparse (~0.1% non-zeros). Once I moved to rbc, having an intermediate boolean mask - like it is implemented right now for the brute-force approach - turned out to be a bottleneck due to the bandwidth that was needed for subsequent matrix ops. I went for CSR column indices instead (there was no need to explicitly store the actual values, forgot to mention that before), which is already beneficial at 1/4th non-zeros.

In addition, a reduction in matrix size (per row) can directly be utilized to increase query batch sizes - which is why it would be beneficial to have an option in the API to limit the resulting connections (e.g. by an additional k) and retrieve some kind of information whether that limit was reached. This could be a good-enough result for many, but in case the algorithm requires an exact eps-nn (like DBSCAN) one could still decrease the batch size and retry based on that information.

Without the intermediate dense representation I need two passes to fill CSR - one to count the columns for each row and a second to fill the global column_indices. My current prototype uses some additional scratch space to store a fixed amount of columns per row in dense form which will be copied into the resulting column indices afterwards. In case that space is to small the code needs to recompute distances in the second pass. I guess COO format with global atomics could also be an option here.

I don't think we would want to do a second pass w/ the RBC algorithm, but we would want to expose (and use) a masked matrix multiplication primitive like cusparse's SDDMM to compute the final distances.

While we don't need the actual distances in DBSCAN, it is still good that you point out that requirement. Would it be necessary to have a dense mask here or would a csr-like mask be sufficient? There is also still the option of providing 2 separate APIs for dense and sparse.

If there are users that set their epsilon poorly and end up with a full pairwise distance matrix, we can let them know this up front by testing how many RBC clusters are within eps and the total number of points that belong to those closest balls.

IIRC the distances against all RBC cluster representatives for each query point is required as a preparational step before we can prune out any random ball clusters by triangulization. As this would be done as part of the epsUnexpL2SqNeighborhood query I don't understand how it could give the application information up front. Are you thinking about splitting the API in two subsequent calls?

We could even go a step further to make the heuristic more accurate and use the variance of the balls to determine the confidence ranges.

Yes, we could provide some query-independent statistics on the RBC-Index that would help to identify unreasonable eps-values.

mfoerste4 commented 10 months ago

@cjnolet , please have a look at the proposed API in #2028. It has support for dense & CSR. The CSR version can be triggered in an exact and deterministic 2-pass way (first pass compute sizes, second pass fill column indices) or in a single pass with a maximum k (as for k-nn).