ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
982 stars 330 forks source link

Add RayJob training example using pytorch resnet image classifier #2107

Closed andrewsykim closed 1 month ago

andrewsykim commented 2 months ago

Why are these changes needed?

Add an example RayJob based on the Finetuning a Pytorch Image Classifier with Ray Train example.

This example will be referenced for user guides that demonstrate distributed checkpointing with GCSFuse.

Also updates the existing pytorch text classifier with an example using GCSFuse.

Related issue number

Checks

andrewsykim commented 1 month ago

I prefer not to update files under pytorch-text-classifier/. If we update the YAML/Python files, users will fail to follow the original documentation, which does not focus on checkpointing.

I removed changes in the pytorch-text-classifier sample. I wanted to have an example of it where it uses a shared filesystem, but I guess we can keep those two examples separate

kevin85421 commented 1 month ago

The CI failure is not related to this PR. Merge.