MNIST Kubeflow Example Starts the Worker Pod then Set Status to Error

MHGanainy commented 1 year ago

Describe the issue: My Issue is after executing the following command nnictl create --config nni/examples/trials/mnist-tfv1/config_kubeflow.yml it starts the experiment successfully. And it sends the TFJob to Kubeflow and kubeflow starts a working pod that gets the image msranni/nni:latest and then starts running for milliseconds and then fails. After executing the command kubectl describe pod nniexp this is the output:

Name:             nniexpcxi61rqmenvlinyj-worker-0
Namespace:        default
Priority:         0
Service Account:  default
Node:             kind-worker/172.18.0.3
Start Time:       Thu, 08 Dec 2022 14:03:29 +0100
Labels:           group-name=kubeflow.org
                  job-name=nniexpcxi61rqmenvlinyj
                  replica-index=0
                  replica-type=worker
                  training.kubeflow.org/job-name=nniexpcxi61rqmenvlinyj
                  training.kubeflow.org/job-role=master
                  training.kubeflow.org/operator-name=tfjob-controller
                  training.kubeflow.org/replica-index=0
                  training.kubeflow.org/replica-type=worker
Annotations:      <none>
Status:           Failed
IP:               10.244.2.68
IPs:
  IP:           10.244.2.68
Controlled By:  TFJob/nniexpcxi61rqmenvlinyj
Containers:
  tensorflow:
    Container ID:  containerd://698b72e5a0279c3e0bd1ae3291e9bb61ae1b436e0b402689d140d6e164f1ed4c
    Image:         msranni/nni:latest
    Image ID:      docker.io/msranni/nni@sha256:7047d1245d307bc7bb1b76e66889bff6fdcea1bb2728200e06dd845ef64fe2a9
    Port:          2222/TCP
    Host Port:     0/TCP
    Args:
      sh
      /tmp/mount/nni/cxi61rqm/LinYj_run.sh
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 08 Dec 2022 14:04:02 +0100
      Finished:     Thu, 08 Dec 2022 14:04:05 +0100
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  8Gi
    Requests:
      cpu:        1
      memory:     8Gi
    Environment:  <none>
    Mounts:
      /tmp/mount from nni-vol (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-d2dgc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  nni-vol:
    Type:      NFS (an NFS mount that lasts the lifetime of a pod)
    Server:    10.0.2.15
    Path:      /nfs/share
    ReadOnly:  false
  kube-api-access-d2dgc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  55s   default-scheduler  Successfully assigned default/nniexpcxi61rqmenvlinyj-worker-0 to kind-worker
  Normal  Pulling    54s   kubelet            Pulling image "msranni/nni:latest"
  Normal  Pulled     23s   kubelet            Successfully pulled image "msranni/nni:latest" in 31.242920489s
  Normal  Created    22s   kubelet            Created container tensorflow
  Normal  Started    22s   kubelet            Started container tensorflow

Environment:

NNI version: v2.10
Training service (local|remote|pai|aml|etc): kubeflow
Client OS: Ubuntu
Server OS (for remote mode only):
Python version: 3.10.8
PyTorch/TensorFlow version:
Is conda/virtualenv/venv used?: Yes
Is running in Docker?: Yes but by specifying the image in the config_kubeflow only

Configuration:

Experiment config (remember to remove secrets!):
Search space:

Log message:

nnimanager.log:
dispatcher.log:
nnictl stdout and stderr:

How to reproduce it?:

MHGanainy commented 1 year ago

I would like to add on this. This is my config_kubeflow.yml file

authorName: default
experimentName: example_dist
trialConcurrency: 1
trialCommand: python3 mnist.py
maxExecDuration: 1h
maxTrialNum: 1
#choice: local, remote, pai, kubeflow
trainingServicePlatform: kubeflow
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
  #choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: maximize
trial:
  codeDir: .
  worker:
    replicas: 1
    command: "python3 mnist.py"
    gpuNum: 0
    cpuNum: 1
    memoryMB: 8192
    image: msranni/nni:latest
kubeflowConfig:
  operator: tf-operator
  apiVersion: v1
  storage: nfs
  nfs:
    server: 10.0.0.4
    path: /mnt/nfs_share/

And When I Openned the "trialrunner_stderr" after running the experiment the following was logged into it:

/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.12) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/nni/tools/trial_tool/trial_runner.py", line 170, in <module>
    args.trial_command = settings["command"]
KeyError: 'command'

The error specifies that my settings.json don't have a command field

And the following is my settings.json

{"experimentId":"x0rpufs7","platform":"kubeflow","nniManagerIP":"10.0.0.4","nniManagerPort":8081,"nniManagerVersion":"2.10.0","logCollection":"none","enableGpuCollector":false,"commandChannel":"web"}

MHGanainy commented 1 year ago

@liuzhe-lz Hey could you have a look at this issue.

MHGanainy commented 1 year ago

@liuzhe-lz I think I found the issue. The example in the directory "nni/examples/trials/mnist-tfv1/config_kubeflow.yml" is out dated. I tried to reverse through the code base and found the following YAML working for me.

searchSpaceFile: search_space.json
trialCommand: python3 mnist.py
trialGpuNumber: 0
trialConcurrency: 1
maxTrialNumber: 1
tuner:
  name: TPE
  classArgs:
    optimize_mode: maximize
trainingService:
    platform: kubeflow
    reuseMode: true
    worker:
      command: python3 mnist.py
      code_directory: .
      dockerImage: msranni/nni
      cpuNumber: 1
      gpuNumber: 0
      memorySize: 4096
      replicas: 1
    operator: tf-operator
    storage:
      storageType: nfs
      server: 10.10.10.10
      path: /
    apiVersion: v1

Lijiaoa commented 1 year ago

@MHGanainy Looks like this issue had been resolved by yourself. So could you help close it as completed?

microsoft / nni

MNIST Kubeflow Example Starts the Worker Pod then Set Status to Error #5274