microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
14.05k stars 1.82k forks source link

Running NNI with Kubeflow Plugin Without "default" namespace #3596

Open rohin-dasari opened 3 years ago

rohin-dasari commented 3 years ago

Environment:

Log message:

What issue meet, what's expected?: NNI can't start a Kubeflow training service in our Kubernetes cluster without belonging in the "default" namespace. We would like to avoid giving NNI "default" permissions since we have other tasks and jobs running in our Kubernetes cluster and would like to keep the NNI optimization tasks as independent of our other jobs as possible. Is it possible to specify the namespace NNI uses when it runs using the Kubeflow plugin?

SparkSnail commented 3 years ago

NNI does not support setting namespace in configuration yet, but you could change NNI source code for your requirement, and build NNI manually. change code here: https://github.com/microsoft/nni/blob/02eab99b9bb280e385a7af71b4c6c4a73bae1f04/ts/nni_manager/training_service/kubernetes/kubeflow/kubeflowTrainingService.ts#L373 Build and install NNI locally: https://github.com/microsoft/nni/blob/master/docs/en_US/Tutorial/InstallationLinux.rst#build-wheel-package-from-nni-source-code

rohin-dasari commented 3 years ago

Thanks for the response.

I made the change and rebuilt from source, but the error still persists. Is there another specific place where a code change is required? I looked around the nni/ts/nni_manager/training_service/kubernetes/kubeflow/ directory and found some other references to the default namespace in kubeflowApiClient.ts. I swapped those values out with our namespace, but that raised a new issue:

[2021-04-30 14:06:13] ERROR [ 'Error: 404 page not found\n\n at _request (/home/local/TECHNICALABS/rdasari/miniconda3/lib/python3.8/site-packages/nni_node/node_modules/kubernetes-client/lib/backends/request.js:189:25)\n at Request.request [as _callback] (/home/local/TECHNICALABS/rdasari/miniconda3/lib/python3.8/site-packages/nni_node/node_modules/kubernetes-client/lib/backends/request.js:148:14)\n at Request.self.callback (/home/local/TECHNICALABS/rdasari/miniconda3/lib/python3.8/site-packages/nni_node/node_modules/request/request.js:185:22)\n at Request.emit (events.js:198:13)\n at Request. (/home/local/TECHNICALABS/rdasari/miniconda3/lib/python3.8/site-packages/nni_node/node_modules/request/request.js:1154:10)\n at Request.emit (events.js:198:13)\n at IncomingMessage. (/home/local/TECHNICALABS/rdasari/miniconda3/lib/python3.8/site-packages/nni_node/node_modules/request/request.js:1076:12)\n at Object.onceWrapper (events.js:286:20)\n at IncomingMessage.emit (events.js:203:15)\n at endReadableNT (_stream_readable.js:1145:12)' ]

SparkSnail commented 3 years ago

@rohin-dasari did you fixed the issue? Error: 404 page not found this error seems caused by kubernetes environment, could you double check if you could run kubeflow job successfully without NNI?

rohin-dasari commented 3 years ago

I was able to confirm that the namespace I am pointing NNI to does in fact exist. I will work on getting a Kubeflow job to run successfully on the namespace and get back to you.

kvartet commented 3 years ago

@rohin-dasari Any updates? Thank you!