rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.04k stars 521 forks source link

[FEA] Automate distributed ML pytests #1910

Open cjnolet opened 4 years ago

cjnolet commented 4 years ago

Currently, cuml does not have CI resources for multi-gpu testing. Though we have been told it’s possible, the resource constraints may initially limit this to only 2 gpus.

Recently, a cuml bug was found that appears to have been introduced several months ago, and required more than 2 gpus to present itself. Cuml relies on several libraries under active development, creating the need for more frequent verification in multi-gpu (and eventually multi-node) environments. We should be testing against the bleeding edge versions of these libraries so we can find breaking updates early.

Ideally, we would be executing multi-gpu and multi-node pytests automatically, at least once daily, if not more often. These tests should also be executed against the bleeding edge versions of Dask and UCX-py.

We should schedule these as cronjobs on a DGX until multi-gpu CI is able to support >2 GPUs.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.