[ENH] IVF-* ANN post-integration TODOs

achirkin commented 2 years ago

A few issues and potential points for improvement emerged while integrating ivf-flat approximate kNN (https://github.com/rapidsai/raft/pull/652):

[ ] 1. Padding of the data dimensions We copy the data in ivf_flat while building the index, thus we can pad the data dimensionality dim to any vector size. This would improve the search performance compared to the current approach of adapting the vector length veclen to the dim https://github.com/rapidsai/raft/blob/e9c0d49943a8c010d19e78a87bb70b1dadfc85ff/cpp/include/raft/spatial/knn/detail/ann_ivf_flat.cuh#L130-L132

_Originally posted by @tfeher in https://github.com/rapidsai/raft/pull/652#discussion_r890970098_
[ ] 2. Consider refactoring away managed allocations in balanced k-means At the moment, predict accesses the pointers on the device only, but adjust_centers accesses the pointers on the host only. It would make sense to change the latter to work on the device as well, or switch to explicitly copying the data.
[x] 3. Consider improving raft::linalg::rowNorm At the moment, the raft's version is slower than the helpers in the PR https://github.com/rapidsai/raft/blob/e9c0d49943a8c010d19e78a87bb70b1dadfc85ff/cpp/include/raft/spatial/knn/detail/ann_utils.cuh#L255-L262

In progress: https://github.com/rapidsai/raft/pull/1011
[x] 4. Make more flexible versions of matrix primitives At the moment, some of the helper functions in ann_utils.cuh cannot be replaced with the matching counterparts in raft, because they require different input and output types.
[ ] 5. Use a proper sampling in build_optimized_kmeans At the moment, we use simple cudaMemcpy2DAsync, a sampling may be a more robust solution.
[ ] 6. Python wrapper Current version needs to be updated. Shall we move it from cuml to raft along the way?
[ ] 7. MetricProcessor MetricProcessor seems to aim at two things: (1) improve performance (speed and quality?) by normalizing the data for some metrics and (2) extend support of metrics without modifying the main kernels (e.g. cosine and dot product similarity are the same for normalized data). However, it modifies input data in place, which may sometimes be avoided. I think, we should investigate this: (a) try to avoid modifying data, (b) check where it is really needed for performance (1).
[ ] 8. Processing NaN/missing entries At this moment, we don't do anything special about NaN values. Some potential downstream projects (e.g. faiss), as well as end-users may need this. During search, we could impute missing entries in the data using the center vectors of the corresponding clusters. For building, we'd need something more complicated to correctly calculate cluster centers.
[x] 9. Investigate possible rare issues with recall values when n_probes == n_lists The recall should always be 1.0 in this case; currently workarounded https://github.com/rapidsai/raft/pull/766
[ ] 10. Reduce the overheads of the launch configuration logic for IVF-PQ The logic of selecting the launch parameter in IVF-PQ has become rather complicated and may incur a measurable overhead for a small enough work size. In particular, repeated calls to cudaGetDeviceProperties are taxing. Consider re-organizing the batching logic to perform the configuration at most once and caching cudaDeviceProp in the raft handle.

_Source: https://github.com/rapidsai/raft/pull/926#discussion_r1019580406_
[x] 11. Migrate to mdspan-based API Use mdspan instead of raw pointers to handle input/output data.

_Source: https://github.com/rapidsai/raft/pull/926#discussion_r1023049567_

cjnolet commented 2 years ago

This code in the RBC algorithm uses existing RAFT primitives to sample some small number of items from an input matrix without replacement and it should be useful for implementing number 5 above (constructing a k-means training set by sampling rows from X w/o replacement).

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Nyrio commented 1 year ago

You can mark some items as done:

(3) #1011
(4) Various PRs: #909, #911, #912, #979, #1011

rapidsai / raft

[ENH] IVF-* ANN post-integration TODOs #711