rapidsai / raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.
https://docs.rapids.ai/api/raft/stable/
Apache License 2.0
747 stars 189 forks source link

[QST] num_threads on ANN latency benchmark #2320

Open pazzap123 opened 4 months ago

pazzap123 commented 4 months ago

I understand the latency test to be 1 query at a time, as far as I can tell (and throughput is to send in many concurrently to the extent there are threads via OMP). I think this code confirms that:

In benchmark.hpp , function run_main: if (metric_objective == Objective::LATENCY) { if (threads[0] != 1 || threads[1] != 1) { log_warn("Latency mode enabled. Overriding threads arg, running with single thread."); threads = {1, 1}; } }

I was a little confused by what this was trying to do in hnswlib_wrapper.h where there are thread pools created for the latency test, and search calls are submitted to the pool:

// Create a pool if multiple query threads have been set and the pool hasn't been created already bool create_pool = (metricobjective == Objective::LATENCY && numthreads > 1 && !threadpool); if (create_pool) { threadpool = std::make_unique(numthreads); }

.... if (metricobjective == Objective::LATENCY && numthreads > 1) { threadpool->submit(f, batch_size); }

Am i misunderstanding this? Thanks in advance!

cjnolet commented 4 months ago

@pazzap123 your understanding of latency and throughput modes are mostly correct, except that both the latency and throughput mode can have a batch size > 1. As a result, we are in effect measuring the latency of each batch.

To make this comparison fair, we use a thread pool for the cpu-based algorithms, since the gpu-based algorithms will process a batch at a time by saturating the gpu.

pazzap123 commented 4 months ago

Thank you! I made a mistake by trying to use the command line -threads to control available threads for batching in the latency test even though the help says it is for the throughput test (I set -batch-size=10 and nthreads=1 and then nthreads=10, and assumed the batch would get serialized on case1 and all 10 would be sent together in case2). I have to use numThreads in the conf files instead for this.

When I compare the performance on a high core count system of -> Throughput test with batch-size=10, -threads=10 -> Latency test with batch-size=10, numThreads=10

The throughput test performs better, i assume because of two effects: 1) All queries have to complete in a batch for the latency test before it tries another iteration, so the longest leg impacts that batch's time; For throughput, as soon as any query is done, another can be jammed in (meaning if query 1 of those 10 finished, another query can be sent in from the next iteration)?

2) Maybe a second overhead related to thread synchronization overhead (pthreads) at the boundaries of each batch since it has to wait for all to complete and synchronize.

I tried to illustrate it with a simple example of how things would work with some long running queries.

hnsw-lat-bw