Open cjnolet opened 4 years ago
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Currently, cuml does not have CI resources for multi-gpu testing. Though we have been told it’s possible, the resource constraints may initially limit this to only 2 gpus.
Recently, a cuml bug was found that appears to have been introduced several months ago, and required more than 2 gpus to present itself. Cuml relies on several libraries under active development, creating the need for more frequent verification in multi-gpu (and eventually multi-node) environments. We should be testing against the bleeding edge versions of these libraries so we can find breaking updates early.
Ideally, we would be executing multi-gpu and multi-node pytests automatically, at least once daily, if not more often. These tests should also be executed against the bleeding edge versions of Dask and UCX-py.
We should schedule these as cronjobs on a DGX until multi-gpu CI is able to support >2 GPUs.