Operate First Data Science Community Meetup #10

Title Distributed ML workloads on OpenShift

Short Description

As the datasets and models get bigger, the demand for more powerful and efficient GPUs is rapidly increasing. Oftentimes a single GPU is not adequate for an ML use case. An alternative to upgrading the GPU hardware is to distribute the ML workload either across several GPUs on one node, or across multiple nodes each containing one or several GPUs. The ability for the latter is especially preferred when a single machine can fit only so many GPUs.
In this talk, we will explore how one can distribute a machine learning workflow across several nodes with GPU hardware in a cloud environment. We will use PyTorch to carry out the ML training and Kubeflow, Node Feature Discovery and GPU operators to distribute the ML workload.
The attendees will understand how to overcome the GPU hardware limits of a single node training by taking advantage of GPUs on other machines, and therefore, maximizing the utilization of GPUs in an open cloud environment.

Presentation Day April 19, 2022

Presenter(s) @heyselbi

Actionable Items (Following actions would be carried out by the Organizers of the Operate First Datato Science Community Meetup)

[x] Add the topic and description to the agenda.
[x] Send an invite via email using the email template and a tweet through the Operate First twitter account by adding a PR to operate-first-twitter.
[x] Present the talk in Operate First Data Science Meetup. Follow the format and record the meetup.
[x] Edit the video recording of the Meetup and upload the video recording to Operate First YouTube channel.
[x] Add slides to the content
[x] Send out meetup recording via email using the email template and tweet through Operate First twitter by adding a PR to operate-first-twitter.

operate-first / operate-first-data-science-community