divyegala commented 5 years ago

From what I observed, and from conversation with @cjnolet, every cuml algorithm creates a handle which uses the default stream (NULL, we checked up to cumlHandle_impl constructor). Without knowing that greater concurrency can be achieved when running two things together just by setting different handles and streams, we lose out on a powerful feature (I know that this is covered in the Developer Guide). It will be nice to have the Handle class on docs.rapids.ai, and an example notebook demonstrating how this concurrency is acheived.

cjnolet commented 5 years ago

1001

cjnolet commented 3 years ago

@divyegala, given the progression of the handle API, it's movement to raft, and our documntation in general, do you feel this issue is still relevant or can it be closed now?

beckernick commented 2 years ago

This discussion came up again today, as DBSCAN Python docs indicate it's possible to use Handles/streams for concurrency. We noted several points:

There wasn't a handy example of how to properly execute this documented behavior
Python fit calls are blocking, which would require using something like multiple host threads
It's unclear if this would work for DBSCAN, as some algorithms rely on CuPy ops that may require running on the null stream

cc @divyegala

divyegala commented 2 years ago

@cjnolet @dantegd revisiting this topic as it came up today. After my own experiments with trying to achieve concurrency for running multiple cuML models on a single or multiple host threads, I am capturing here what we need to do to eventually achieve full concurrency.

For single host thread, multiple models (this basically needs the end-to-end Python API call to be asynchronous):

Moving any data from device to host (pageable memory) is synchronous to the host, so it will not work
No synchronous host functions
Every device function must be asynchronous to its stream I foresee this paradigm working well for predict class of functions of simpler algorithms like linear models

For multiple host threads, multiple models:

Device to host transfers (pageable memory) are okay
Synchronous host functions are okay
Synchronous device functions are okay to its stream This paradigm may work well for both fit and predict class of functions

Common to both paradigms:

Device allocations and de-allocations are blocking calls, so they should be using the async allocation API (cudaMallocAsync)

Challenges/tasks needed to be performed to achieve both the above paradigms:

Compile with default-stream per-thread behavior
Make memory management dependencies like CuPy/RMM to use asynchronous APIs and compliant with per-thread stream behavior
Explore device-to-host transfers for pinned host memory which is async

Reference for cudaMemcpyAsync behavior: https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-async

leonardottl commented 1 year ago

@divyegala @cjnolet @dantegd I am wondering if there are any updates regarding executing fit of multiple cuml models concurrently using different threads and streams, does anyone know of a working example of this?

mnlcarv commented 4 days ago

@leonardottl , did you find any working example of executing fit of multiple cuML models concurrently using different threads and streams? Or maybe even a simpler example of a distributed fit on a single cuML model (e.g K-means) so that multiple concurrent fit tasks can be executed on the same GPU?

rapidsai / cuml

[DOC] In-depth documentation for Handles and Streams #854

1001