Open StefanoFioravanzo opened 3 weeks ago
Thank you for creating this @StefanoFioravanzo! /good-first-issue
@andreyvelich: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue
command.
Hi, new to kubeflow, would like to work on this. Couple of questions:
Thank you for your time. @andreyvelich @StefanoFioravanzo
I want to take this up, and would love any advice @StefanoFioravanzo @andreyvelich
@LogicalGuy77 @aryan-py Thanks for stepping up!
Docs will be contributed to the Kubeflow Website, more specifically under Katib's Reference
section here https://www.kubeflow.org/docs/components/katib/reference/
I would recommend starting with a google doc, especially since you are not familiar with these concepts. This will allow the project owners to review faster. You can then move the content to a PR once it's in a good state. I suggest sharing the Google Doc with the whole google discuss group, with commenter privilege (if you do that, remember to un-tick the option to notify the recipient, otherwise everyone in the Kubeflow google group will be spammed).
Thanks for your interest @LogicalGuy77 and @aryan-py!
Yes, @StefanoFioravanzo is right, we are planning to contribute this docs to the Kubeflow website: https://github.com/kubeflow/website
Just a small correction, we should use Training Operator user-guides section to explain how various APIs work with Training Operator to achieve fault tolerance: https://www.kubeflow.org/docs/components/training/user-guides/
I would suggest to start with RestartPolicy
API to handle ML training Pod restarts, and Elastic Policy
API for fault-tolerant PyTorch on Kubernetes.
cc @kubeflow/wg-training-leads
@andreyvelich ops, sorry indeed we are talking about training-operator. But shouldn't this go under Reference? I think we are talking about how fault-tolerance is designed in the operator.
What kind of user guides are you thinking about?
I guess, we can add two things:
I've been going through lots of code and documentation and have prepared an initial draft for Restart Policy: Google Doc. I've provided commenter access to kubeflow-discuss google group. I would love to have your guidance to improve it further. I was thinking of dividing the task into three parts:
Could you elaborate more on what kind of diagrams are you looking for?
Thank you for your time. @andreyvelich @StefanoFioravanzo
Hi @andreyvelich @StefanoFioravanzo,
Just wanted a follow up on this issue. I've added docs for elastic policy as well. Would you be able to take a look and provide feedback when you have a chance?
Thanks!
What you would like to be added?
Since @andreyvelich commented:
We should write some reference architecture docs to expose these features to our users.
Why is this needed?
Users do not have a reference to understand and appreciate the fault tolerance capabilities offered by training operator
Love this feature?
Give it a 👍 We prioritize the features with most 👍