[FEA] Asynchronous mode for estimator operations that use cuda streams

cjnolet commented 3 years ago

Most of our estimators use a cuda stream that has been set on the base class to make calls to the c++ API underneath, blocking the main thread to synchronize the stream. While this design works nicely for larger datasets that can utilize most or all of the GPU at the same time, it doesn't provide any benefit for smaller datasets that could be able to perform many parallel train/predict calls in parallel.

A good example of this is during an HPO, where the dataset might be so small that each different parameter combination could be executed in parallel on a single GPU.

Unfortunately, many of the underlying primitives and algorithms still perform explicit stream syncs, but we still might be able to benefit from some additional parallelism across a single GPU by making the stream sync after calls to fit() and predict() optional. This could allow the power user to perform these functions as asynchronously as possible right from the Python layer and give them control to call sync the underlying cuml stream wrappers manually.

github-actions[bot] commented 3 years ago

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

amakelov commented 3 years ago

Hi everyone, I'm a new user of rapidsai, and I'm interested in training multiple random forest models on the same GPU. I was wondering if there has been any progress on this feature, and whether I have some alternatives to this approach to achieve the same result?

rapidsai / cuml

[FEA] Asynchronous mode for estimator operations that use cuda streams #3203