[FEA] Improve Vamana/DiskANN build

Initial Vamana build has limited features and several items to improve performance or usability. Once these items are addressed, we can consider moving it out of the "experimental" namespace. They include:

[ ] Reduce global memory footprint - this limits dataset and graph construction size. Simple ways include batching reverse edge generation, but will still be limited by requiring the entire dataset and graph be resident in device memory. Need to investigate storing the graph in host memory and the potential performance impacts as well.
[ ] Add support for any dimension dataset (current have alignment issues with odd or < 16 for uint8 / int8.
[ ] Auto-select and optimize queue_size for different visited_size values. Also, add support for any visited_size value (currently only poewers of 2.
[ ] Improve performance for high-degree graph build. This seems limited by GreedySearch, which becomes increasingly costly. Things to investigate include reducing shared memory/registers to improve occupancy, improving priority queue efficiency, or trying other data structures.
[ ] Add additional distance metrics - currently only L2 is supported. At least inner product and cosine are needed.
[ ] Add support in C and python APIs.
[ ] Create documentation and collect more extensive benchmark results.

rapidsai / cuvs

[FEA] Improve Vamana/DiskANN build #393