kubeflow / katib

Automated Machine Learning on Kubernetes
https://www.kubeflow.org/docs/components/katib
Apache License 2.0
1.45k stars 425 forks source link

[SDK] Fix empty list for env variables and numpy version #2360

Closed andreyvelich closed 1 week ago

andreyvelich commented 1 week ago

After this PR: https://github.com/kubeflow/katib/pull/2304, the tune API doesn't work correct.

@helenxie-bit and @quloos Identified bug when using tune API. If user doesn't set env_per_trial parameter, the Experiment creation fails with this error:

"message":"admission webhook \"validator.experiment.katib.kubeflow.org\" denied the request:
invalid spec.trialTemplate: unable to convert: /spec/template/spec/containers/0/env - [] to Job

We should prioritise unit test PR for Katib SDK to help us detect invalid SDK: https://github.com/kubeflow/katib/pull/2325 cc @tariq-hasan

/assign @johnugeorge @tenzen-y @helenxie-bit @quloos

google-oss-prow[bot] commented 1 week ago

@andreyvelich: GitHub didn't allow me to assign the following users: helenxie-bit, quloos.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to [this](https://github.com/kubeflow/katib/pull/2360): >After this PR: https://github.com/kubeflow/katib/pull/2304, the `tune` API doesn't work correct. > >@helenxie-bit and @quloos Identified bug when using `tune` API. >If user doesn't set `env_per_trial` parameter, the Experiment creation fails with this error: >``` >"message":"admission webhook \"validator.experiment.katib.kubeflow.org\" denied the request: >invalid spec.trialTemplate: unable to convert: /spec/template/spec/containers/0/env - [] to Job >``` > >We should prioritise unit test PR for Katib SDK to help us detect invalid SDK: https://github.com/kubeflow/katib/pull/2325 cc @tariq-hasan > >/assign @johnugeorge @tenzen-y @helenxie-bit @quloos Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
andreyvelich commented 1 week ago

@kubeflow/wg-training-leads It looks like numpy 2.0 was released yesterday: https://github.com/numpy/numpy/issues/24300.

Since torchvision just installs the latest numpy version, I am using numpy==1.26.0 version in our Trial images.

Otherwise, I see the following error from PyTorch:

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
google-oss-prow[bot] commented 1 week ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/katib/blob/master/OWNERS)~~ [tenzen-y] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
helenxie-bit commented 1 week ago

It works now! Thank you so much! @andreyvelich