[FEA] Consolidate remaining CUDA runtime calls in `bench/ann`

cjnolet commented 1 year ago

While working on #1304, there are still several places in cpp/bench/ann/common where cuda runtime calls are being made directly even though there are utilities in RAFT that could be used instead. There are also some calls that are allocating and freeing memory directly, rather than using RMM. I didn't want to change these initially for fear that it might introduce some unexpected bugs but this should be revisited and the CUDA runtime calls consolidated to utilize RAFT APIs wherever possible.

achirkin commented 1 year ago

Note, https://github.com/rapidsai/raft/pull/1661 goes the other way around. It minimizes the use of the utilities in cpp/bench/ann/common to only one executable, and decouples it from raft completely. As a result, only a handful of cuda functions is used and isolated in cuda_stub.hpp. It links CUDA at runtime via dlopen and can proceed even if CUDA is not available.

cjnolet commented 1 year ago

@tfeher and @achirkin- I'm still not convinced that #1661 is the direction we want to be taking the benchmarks. First, we have recently removed dlopen calls from raft comms because they were terribly hard to manage generally for users and I don't know that we want to be consolidating all of the executables into a single executable.

That being said, Im not totally against the use of gbench in the benchmarks but for now we are proceeding with the cpu-only by isolating the builds of the existing executables.

achirkin commented 1 year ago

Hi @cjnolet , I would also hesitate to use dlopen in a library component, but in the PR it is only ever used in the ANN_BENCH executable. Nothing should link against an executable, so it won't affect anybody. Having a single executable for all benchmarks for me as a devtech is a must have, because it tremendously simplifies benchmarking and profiling: for example it's much easier to run a single nsys profile and get a single output file when comparing multiple implementations.

With the latest fixes to python scripts the PR is ready now. Surely in the end the decision is up to you. Just let me know if you prefer to keep the current benchmarks as is - I'll just pivot the gbench branch into a separate repo for internal use. But I cannot maintain it up-to-date with the ongoing changes indefinitely.

cjnolet commented 1 year ago

but in the PR it is only ever used in the ANN_BENCH executable. Nothing should link against an executable, so it won't affect anybody.

Unfortunately, we are now pointing end-users towards these benchmarks and they are quickly becoming a vital part of the RAFT toolset. I think we need to adjust the perspective that these are just for internal use and that we shouldn't be concerned with the user experience here. dlopen has bitten us too many times in the past and I'd like to avoid the issues we faced trying to support it.

Having a single executable for all benchmarks for me as a devtech is a must have, because it tremendously simplifies benchmarking and profiling: for example it's much easier to run a single nsys profile and get a single output file when comparing multiple implementations.

I think what you might be saying here is that having a single executable for the RAFT algorithm benchmarks is a must-have, right? RAFT algorithm benchmarks don't need a cpu-only option, so I'm not sure why all of the benchmarks need to be consolidated into a single binary when only the HNSW (and potentially some of the FAISS algorithms) need to be cpu-only. Separate binaries raft, faiss cpu, faiss gpu, hnsw, and GGNN should be fine, no? Then we only need to link cuda for those binaries that require gpu and we can more easily separate the other binaries into cpu-only packaging.

I'll just pivot the gbench branch into a separate repo for internal use.

Even if this is the direction you decide to take these changes, we still need to make sure that any benchmarks we are publishing publicly for users are performed with the tools which are publicly available in RAFT so that we aren't sharing results which they cannot directly reproduce. My suggestion would be that we try to make this work, and if my suggestion above is enough for that, then I think we can consider that an alternative to pulling this out.

achirkin commented 1 year ago

The way it works now in the PR is that we have a single executable ANN_BENCH and multiple shared libraries, one per algorithm. ANN_BENCH uses dlopen to load the algorithms and (optionally) cuda, so it does not depend on CUDA. Hence you can build all targets and ship only relevant ones (libhnswlib.so and ANN_BENCH, which would run on CPU). You can also ship all algorithms/libraries; in the absence of CUDA some benchmarks will be disabled, others will run just fine. So the only difference is that all benchmarks can be called at the same time via a single executable, while retaining the modularity of the current main branch approach.

I need to be able to run all benchmarks via a single executable and produce a single .csv/json report, because I often cannot rely on multiple layers of python/perl wrappers: the require setting up conda environment and lack most of the needed functionality.

If having separate executables is the only issue you have with the PR, I can change the cmake config to produce both: shared libraries + ANN_BENCH and the executables for every algorithm (which don't use dlopen).

cjnolet commented 1 year ago

If having separate executables is the only issue you have with the PR, I can change the cmake config to produce both: shared libraries + ANN_BENCH and the executables for every algorithm (which don't use dlopen).

dlopen is the main deal breaker there, so we need to fix that part, and not by over complicating the cmake to fork two different build paths. I'm still not sure I understand the motivation for only having a single executable- nsys and csv output will work just the same if we provide a bash script or a python script running multiple executables or a single executable.

I'm also fine producing two individual executables (one which is cpu-only and one with gpu+cpu). Neither of the options I've mentioned above require dlopen for either case and both still achieve what you are looking for.

achirkin commented 1 year ago

It takes time and effort to combine multiple csv files into one, because the columns are different; preamble/context is also important as it contains the information about the dataset and GPU.

@cjnolet please have a look at the latest update. I've set up the cmake to produce one executable per benchmark, same as before, by default; they don't link against/use dl functionality; HNSWLIB_ANN_BENCH does not link against CUDA as well.

I hope we still can retain ANN_BENCH as an optional feature that can be enabled at the configure time; it really speeds up the productivity when I need to profile and compare some algorithms multiple times a day (e.g. I have the original prototype ivf-pq implementation in my internal branch as well and it's very convenient to run both implementations at the same time).

cjnolet commented 1 year ago

Thank @achirkin but again dlopen is going to be a deal breaker, whether optional or not. I'm still confused why it's needed to begin with- thus far we've built many cpu-only packages (take a look at cuml, as an example) and haven't required it. Seems like a simple pre-processor variable would avoid the need for that altogether, wouldn't it?

achirkin commented 1 year ago

The main reason I have dlopen in ANN_BENCH is not in the cpu_only/gpu issue (HNSWLIB_ANN_BENCH now does not have dlopen and still does not depend on CUDA).

I need it to be able to plug-and-play multiple benchmark implementations and then ideally run them from a single executable. A typical scenario: I build ANN_BENCH and all shared libraries + extra internal implementations once and copy them on a development machine with latest CUDA drivers, for which no conda package is available yet. Then I want to repeatedly do changes to one of the implementations and copy it as well. Then I run pairs/all implementations and compare times. This especially saves the time when the data set is 100+ GB and it takes a few minutes to load if from a slow/network storage to RAM (loading it only once rather than for each model).

At this point, dlopen is only ever used in ANN_BENCH, which is disabled by default. Can you consider it a development option that is not going to be shipped in any of the packages? Much like the Debug build, which we actually don't test in CI and don't ship?

rapidsai / raft

[FEA] Consolidate remaining CUDA runtime calls in `bench/ann` #1318