Closed MHGanainy closed 1 year ago
I would like to add on this. This is my config_kubeflow.yml file
authorName: default
experimentName: example_dist
trialConcurrency: 1
trialCommand: python3 mnist.py
maxExecDuration: 1h
maxTrialNum: 1
#choice: local, remote, pai, kubeflow
trainingServicePlatform: kubeflow
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
codeDir: .
worker:
replicas: 1
command: "python3 mnist.py"
gpuNum: 0
cpuNum: 1
memoryMB: 8192
image: msranni/nni:latest
kubeflowConfig:
operator: tf-operator
apiVersion: v1
storage: nfs
nfs:
server: 10.0.0.4
path: /mnt/nfs_share/
And When I Openned the "trialrunner_stderr" after running the experiment the following was logged into it:
/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.12) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/nni/tools/trial_tool/trial_runner.py", line 170, in <module>
args.trial_command = settings["command"]
KeyError: 'command'
The error specifies that my settings.json don't have a command field
And the following is my settings.json
{"experimentId":"x0rpufs7","platform":"kubeflow","nniManagerIP":"10.0.0.4","nniManagerPort":8081,"nniManagerVersion":"2.10.0","logCollection":"none","enableGpuCollector":false,"commandChannel":"web"}
@liuzhe-lz Hey could you have a look at this issue.
@liuzhe-lz I think I found the issue. The example in the directory "nni/examples/trials/mnist-tfv1/config_kubeflow.yml" is out dated. I tried to reverse through the code base and found the following YAML working for me.
searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialGpuNumber: 0
trialConcurrency: 1
maxTrialNumber: 1
tuner:
name: TPE
classArgs:
optimize_mode: maximize
trainingService:
platform: kubeflow
reuseMode: true
worker:
command: python3 mnist.py
code_directory: .
dockerImage: msranni/nni
cpuNumber: 1
gpuNumber: 0
memorySize: 4096
replicas: 1
operator: tf-operator
storage:
storageType: nfs
server: 10.10.10.10
path: /
apiVersion: v1
@MHGanainy Looks like this issue had been resolved by yourself. So could you help close it as completed?
Describe the issue: My Issue is after executing the following command
nnictl create --config nni/examples/trials/mnist-tfv1/config_kubeflow.yml
it starts the experiment successfully. And it sends the TFJob to Kubeflow and kubeflow starts a working pod that gets the image msranni/nni:latest and then starts running for milliseconds and then fails. After executing the commandkubectl describe pod nniexp
this is the output:Environment:
Configuration:
Log message:
How to reproduce it?: