kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.56k stars 675 forks source link

[SDK] Add more unit tests for TrainingClient APIs #2161

Open andreyvelich opened 1 month ago

andreyvelich commented 1 month ago

We need to add more unit tests for TrainingClient APIs by updating the training_client_test.py. That should help us to detect bugs like this one: https://github.com/kubeflow/training-operator/pull/2160. Let's use this issue to track unit tests for various public APIs.

Feel free to submit PR to add unit tests for one of the following APIs, just comment on this issue on which unit tests you are working on:

cc @kubeflow/wg-training-leads @deepanker13 @Electronic-Waste @tariq-hasan

/good-first-issue /area sdk /area testing

google-oss-prow[bot] commented 1 month ago

@andreyvelich: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/kubeflow/training-operator/issues/2161): >We need to add more unit tests for `TrainingClient` APIs by [updating the `training_client_test.py`.](https://github.com/kubeflow/training-operator/blob/8909f2bfd4680ef7a90672b690b0f4df1d6fdfb9/sdk/python/kubeflow/training/api/training_client_test.py) >That should help us to detect bugs like this one: https://github.com/kubeflow/training-operator/pull/2160. >Let's use this issue to track unit tests for various public APIs. > > >Feel free to submit PR to add unit tests for one of the following APIs, just comment on this issue on which unit tests you are working on: > >- [x] `create_job` >- [ ] `wait_for_job_conditions` >- [ ] `get_job` >- [ ] `list_jobs` >- [ ] `get_job_conditions` >- [ ] `is_job_created` >- [ ] `is_job_running` >- [ ] `is_job_restarting` >- [ ] `is_job_succeeded` >- [ ] `is_job_failed` >- [ ] `get_job_pods` >- [ ] `get_job_pod_names` >- [ ] `get_job_logs` >- [ ] `update_job` >- [ ] `delete_job` > >cc @kubeflow/wg-training-leads @deepanker13 @Electronic-Waste @tariq-hasan > >/good-first-issue >/area sdk >/area testing Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
ayushrakesh commented 1 month ago

Hello @andreyvelich I am working on writing unit test for get_job, so should I seperate the test_data both for create_job and get_job individually or combine them into single test_data by adding one more identifier to each sample which tells for which function ( create_job or get_job) the test is for?

Example test data -

( "create_job_valid_flow_to_create_job_using_image", "create_job", { "name": "test-job", "namespace": "test", "base_image": "docker.io/test-training", "num_workers": 2, }, "success", )

( "get_job_valid", "get_job", {"namespace": "test", "job_name": "valid_job"}, {"mock": "job_response"} ),

andreyvelich commented 1 month ago

Thank you for your contribution @ayushrakesh!

I think it is easier to separate tests data for create_job and get_job API, so it will be easier for us to maintain those lists.

WDYT @droctothorpe @deepanker13 @kubeflow/wg-training-leads ?

tenzen-y commented 1 month ago

Thank you for your contribution @ayushrakesh!

I think it is easier to separate tests data for create_job and get_job API, so it will be easier for us to maintain those lists.

WDYT @droctothorpe @deepanker13 @kubeflow/wg-training-leads ?

As my understanding, this is unit tests, not integration tests. So, dedicated test data every for functions would be better.

droctothorpe commented 1 month ago

In general, if you can recycle fixtures to reduce redundancy, that's usually advisable; whether or not that's possible depends on the expected IO of the two functions; create_jobs doesn't return anything (though maybe it should), so I'm not sure if it makes sense to consolidate the fixtures in this context.

Electronic-Waste commented 1 month ago

@andreyvelich Thank you for guiding me in contributing to Training-Operator and Katib.

I'll write some test cases for training_client_test.py as soon as I understand the whole logics and the source code of Training-Operator.

andreyvelich commented 1 month ago

Thank you @Electronic-Waste. So, maybe @ayushrakesh can create unit test for get_job API and @Electronic-Waste can work on wait_for_job_conditions API.

YosiElias commented 1 month ago

I'm interested in working on the unit test for get_job_pods, any objection?

andreyvelich commented 1 month ago

Sure, thank you for your time @YosiElias! /assign @ayushrakesh @Electronic-Waste @YosiElias

deepanker13 commented 1 month ago

Having a different list will make code more readable, Also should we create a folder and have a test file for each function to make it more organised? @andreyvelich

andreyvelich commented 1 month ago

Having a different list will make code more readable, Also should we create a folder and have a test file for each function to make it more organised? @andreyvelich

@deepanker13 Usually, test file have the same name as the actual file with _test suffix, E.g. you can check the KFP client example: https://github.com/kubeflow/pipelines/tree/master/sdk/python/kfp/client.

tariq-hasan commented 1 month ago

Will the number of test cases be too many to keep in a single training_client_test module?

I was thinking of moving test data, fixtures and utility functions from the katib_client_test module, for example, and putting these in newly created test_data, conftest and utils modules.

That way, I was thinking that the code may be easier to maintain in the katib_client_test module.

And katib_client_test and its associated test_data, conftest and utils modules would be moved to a newly created sdk/python/v1beta1/kubeflow/katib/tests folder.

Please let me know your thoughts.

Reference: https://github.com/kubeflow/katib/compare/master...tariq-hasan:katib:add-unit-test-for-katib-client-tune

andreyvelich commented 1 month ago

@tariq-hasan I think, we discussed before with @droctothorpe that we want to keep _test file alongside with actual files: https://github.com/kubeflow/katib/issues/2184#issuecomment-1660475384. I understand that this cause file to grow, but we are already doing the same for go unit tests. For example: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/experiment/validator/validator_test.go.

I feel that the test data should be associated with function that we test: get_job, create_job, train, etc. KFP follows the same way: https://github.com/kubeflow/pipelines/tree/master/sdk/python/kfp/client.

Any concerns you see with that @tariq-hasan ?

tariq-hasan commented 1 month ago

That sounds good. Thanks for the clarification.

YosiElias commented 1 month ago

I'm interested in working on the unit test for get_job_pod_names and update_job, any objection?

andreyvelich commented 1 month ago

Absolutely, thank you @YosiElias!

Electronic-Waste commented 3 weeks ago

@andreyvelich UTs for wait_for_job_conditions have been added now! PTALđź‘€ when you are available.

Also, can I work on get_job_conditions and related functions whose names begin with is_job_?

Bobbins228 commented 3 weeks ago

Hey @andreyvelich,

I can take a look at get_job and delete_job