kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.51k stars 660 forks source link

PyTorchJobClient not found #2126

Closed thatsdone closed 1 month ago

thatsdone commented 1 month ago

Hello, I noticed that now PyTorchJobClient is missing.

I installed training-operator (1.7.0) from pypi on my Ubuntu 22.04 box. TFJobClinet looks like also missing.

Here are procedures for reproducing the issue.

ubuntu@ubuntu:~$ lsb_release -a
LSB Version:    core-11.1.0ubuntu4-noarch:printing-11.1.0ubuntu4-noarch:security-11.1.0ubuntu4-noarch
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

ubuntu@ubuntu:~$ pip3 list | grep kubeflow-training
kubeflow-training      1.7.0

ubuntu@ubuntu:~$ python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from kubeflow.training import PyTorchJobClient
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'PyTorchJobClient' from 'kubeflow.training' (/home/ubuntu/.local/lib/python3.10/site-packages/kubeflow/training/__init__.py)
>>>
ubuntu@ubuntu:~$ cd NFS/src/ML/training-operator/sdk/python/
ubuntu@ubuntu:~/src/ML/training-operator/sdk/python$ git branch
* master
ubuntu@ubuntu:~/src/training-operator/sdk/python$ git log -n 1 --oneline
be5df91e (HEAD -> master, origin/master, origin/HEAD) Updated Github Action Workflows as per issue #2117 (#2123)
ubuntu@ubuntu:~/src/ML/training-operator/sdk/python$ find . -type f -name '*.py'| xargs grep PyTorchJobClient
andreyvelich commented 1 month ago

Thank you for creating this @thatsdone! Yes, we removed PyTorchJobClient in favour to unify TrainingClient: https://github.com/kubeflow/training-operator/pull/1719. Please use the latest version for the SDK: kubeflow-training==1.8.0rc0

thatsdone commented 1 month ago

@andreyvelich OK, understood. Then, I think it's better start preparation to update the document, I think. :)

Here is how I checked.

$ pip3 list | grep kubeflow-training
kubeflow-training      1.8.0rc0
$ python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from kubeflow.training.api.training_client import TrainingClient
>>>

Thanks a lot! I'm closing this issue.

andreyvelich commented 1 month ago

Yes, this doc is out of date unfortunately; https://github.com/kubeflow/training-operator/tree/master/sdk/python#documentation-for-api-endpoints. We will work on it as part of this: https://github.com/kubeflow/katib/issues/2081