microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.64k stars 548 forks source link

extend prerequisite field in job protocol #5145

Open hzy46 opened 3 years ago

hzy46 commented 3 years ago

Motivation

[OpenPAI protocol]() support users to specify prerequisites (e.g. dockerimage, data, and script) and then reference them in taskrole. There are some limitations in current version.

prerequisites:
  - name: covid_data
    type: data
    uri:
      - https://x.x.x/yyy.zip # data uri
  - name: default_image
    type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
taskRoles:
  taskrole:
    dockerImage: default_image
    data: covid_data
    commands:
      - mkdir -p /data/covid19/data/
      - cd /data/covid19/data/
      - 'wget <% $data.uri[0] %>'
      - export DATA_DIR=/data/covid19/data/

Goal

Proposal

  1. support callbacks in prerequisites
  2. taskrole could reference a list of prerequisites
  3. runtime plugin implementation

Examples

taskRoles: taskrole: dockerImage: default_image prerequisites:

Full Spec:

prerequisites:
  - name: string # required, unique name to find the prerequisite (from local or marketplace)
    type: "dockerimage | script | data | output" # for survey purpose (except dockerimage), useless for backend
    plugin: string # optional, the executor to handle current prerequisite; default is com.microsoft.pai.runtimeplugin.cmd or docker (for dockerimage)
    require: [] # optional, other prerequisites on which the current one depends
    callbacks: # optional, commands to run on events
      - event: "containerStart | containerExit"
        commands: # commands translated by plugin
          - string # shell commands for com.microsoft.pai.runtimeplugin.cmd
          - string # TODO: other commands (e.g. python) for other plugins
    failurePolicy: "ignore | fail" # optional, same default as runtime plugin
    # plugin-specific properties
    uri: string | array # optional, for backward compatibility (it is required before)
    key1: value1 # referred by <% this.parameters.key1 %>
    key2: value2 # TODO: inheritable from required ones

taskRoles:
  taskrole:
    prerequisites: # optional, requirements will be automatically parsed and inserted
      - prerequisite-1 # on containerStart, will execute in order
      - prerequisite-2 # on containerExit, will execute in reverse order

Each of prerequisites will be handled in a way like

for prerequisite in prerequisites:
  plugin(**prerequisite)
hzy46 commented 3 years ago

Update of this issue:

  1. Will sync with @mydmdm to determine the detailed schema. This will be an P1 item for v1.5.0 release.
  2. In OpenPAI runtime, use the following way to handle prerequisites: (1) use existing mechanism to inject commands into preCommands and postCommands (2) don't show explicit plugin definition in user's job protocol (3) make sure parameters and secrets work in prerequisites.
  3. We can add retry policy and failure policy. This can be left to future work.
  4. Sync with @Binyang2014 about support for cluster data.
Binyang2014 commented 3 years ago

Question about this:

  1. Can all runtime plugins merge to prerequisites? If so we could deprecated runtime extra field. Make prerequisites the official way?
  2. Maybe I can treat prerequisites as a high level widget. It can use any implementation they want to achieve. Such as runtime-plugin or inject-command or communicate with other k8s/pai services. And there is other proposal for service plugin: https://github.com/microsoft/pai/issues/4254 does it related?
hzy46 commented 3 years ago

Question about this:

  1. Can all runtime plugins merge to prerequisites? If so we could deprecated runtime extra field. Make prerequisites the official way?
  2. Maybe I can treat prerequisites as a high level widget. It can use any implementation they want to achieve. Such as runtime-plugin or inject-command or communicate with other k8s/pai services. And there is other proposal for service plugin: #4254 does it related?

I think they are used for different scenarios. Prerequisite is the requirement for a job. Without a prerequisite, a job usually fails. And prerequisite should be sharable among users. Runtime plugin is used to extend job protocol's functions. It can be nice-to-have (not necessary), and can be personal config (not sharable). There are some overlaps. Maybe we can move some officially-supported runtime plugin into prerequisites.

mydmdm commented 3 years ago

I have updated the full spec in the main body and here are some examples, including

prerequisites:
  - name: install-pai-copy
    type: script # indicate the purpose, not used by backend but for statistical analyzing (except dockerimage)
    plugin: com.microsoft.pai.runtimeplugin.cmd # default plugin if not specified
    callbacks:
      - event: containerStart
        commands:
          - xxx # commands to setup nodejs
          - npm install -g @swordfaith/pai_copy
    failurePolicy: ignore/fail
  - name: covid_data
    type: data
    plugin: com.microsoft.pai.runtimeplugin.cmd # 
    callbacks:
      - event: containerStart
        commands:
          - mkdir -p /data/covid19/data/
          - cd /data/covid19/data/
          - 'wget https://x.x.x/yyy.zip'
          - export DATA_DIR=/data/covid19/data/
  - name: nfs-storage-1
    type: storage # indicate the purpose, not used by backend but for statistical analyzing
    plugin: com.microsoft.pai.rest.storage # handled by REST server
    config: nfsconfig # special arguments for storage plugin only
    mountPoint: /mnt/nfs-storage-1
  - name: mnist-data
    type: data
    plugin: com.microsoft.pai.runtimeplugin.cmd
    require:
      - nfs-storage-1 # also inherit parameters like mountPoint
    callbacks:
      - event: containerStart
        commands:
          - export MNIST_DIR=<% this.mountPoint %>/mnist
  - name: output-dir
    type: output
    plugin: com.microsoft.pai.runtimeplugin.cmd
    require:
      - nfs-storage-1
    callbacks:
      - event: containerStart
        commands:
          - export OUTPUT_DIR=/tmp/output
      - event: containerExit
        commands:
          - 'if [ -z ${OUTPUT_DIR+x}]; then'
          - echo "Not found OUTPUT_DIR environ"
          - else
          - pai_copy upload  paiuploadtest //
          - fi
  - name: enable-ssh
    type: script
    plugin: com.microsoft.pai.runtimeplugin.ssh
    jobssh: true
    publicKeys: # optional, if not specified, only public keys in user.extensions.sshKeys will be added
      - ... # public keys 

taskRoles:
  taskrole:
    dockerImage: default_image
    prerequisites:
      - mnist-data # required will be automatically parsed and added in backend
      - output-dir
hzy46 commented 3 years ago

(TBD) Test Cases for v1.5.0 release

  1. test cmd prerequisites
    protocolVersion: 2
    name: pre1
    type: job
    jobRetryCount: 0
    prerequisites:
    - type: script
    name: justecho
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: taskStarts
        commands:
          - echo 111
          - echo 222
      - event: taskSucceeds
        commands:
          - echo 333
          - echo 444
    - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
    taskRoles:
    taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    prerequisites:
      - justecho
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - sleep 0s
    defaults:
    virtualCluster: default
    extras:
    com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true

    expected runtime.log:

    
    [Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] [INFO] Starting to exec precommands
    [Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] 111
    [Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] 222
    [Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] [package_cache] Skip installation of group ssh.
    [Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] [INFO] start ssh service
    [Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] * Restarting OpenBSD Secure Shell server sshd
    [Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] ...done.
    [Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] [INFO] Precommands finished
    [Mon Feb  1 02:54:48 UTC 2021] [openpai-runtime] [INFO] USER COMMAND START
    [Mon Feb  1 02:54:58 UTC 2021] [openpai-runtime] [INFO] USER COMMAND END
    [Mon Feb  1 02:54:58 UTC 2021] [openpai-runtime] 333
    [Mon Feb  1 02:54:58 UTC 2021] [openpai-runtime] 444

2. test multiple prerequisites
```yaml
protocolVersion: 2
name: pre2
type: job
jobRetryCount: 0
prerequisites:
  - type: script
    name: justecho_first
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: taskStarts
        commands:
          - echo 111
      - event: taskSucceeds
        commands:
          - echo 222
  - type: script
    name: justecho_later
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: taskStarts
        commands:
          - echo aaa
      - event: taskSucceeds
        commands:
          - echo bbb
  - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    prerequisites:
      - justecho_first
      - justecho_later
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - sleep 0s
defaults:
  virtualCluster: default
extras:
  com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true

expected runtime.log:

[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] [INFO] Starting to exec precommands
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] 111
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] aaa
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] [package_cache] Skip installation of group ssh.
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] [INFO] start ssh service
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] * Restarting OpenBSD Secure Shell server sshd
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] ...done.
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] [INFO] Precommands finished
[Mon Feb  1 02:55:16 UTC 2021] [openpai-runtime] [INFO] USER COMMAND START
[Mon Feb  1 02:55:27 UTC 2021] [openpai-runtime] [INFO] USER COMMAND END
[Mon Feb  1 02:55:27 UTC 2021] [openpai-runtime] bbb
[Mon Feb  1 02:55:27 UTC 2021] [openpai-runtime] 222
  1. test wrong config 1 Error is expected:

    protocolVersion: 2
    name: pre3
    type: job
    jobRetryCount: 0
    prerequisites:
    - type: script
    name: justecho
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: taskStarts
        commands:
          - echo 111
          - echo 222
      - event: taskSucceeds
        commands:
          - echo 333
          - echo 444
    - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
    taskRoles:
    taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    prerequisites:
      - justecho_wrong
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 0
      cpu: 1
      memoryMB: 9672
    commands:
      - sleep 0s
    defaults:
    virtualCluster: default
    extras:
    com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true
  2. test backward-compatibility

This job should work:

protocolVersion: 2
name: covid-chestxray-dataset_88170423
description: >
  COVID-19 chest X-ray image data collection

  It is to build a public open dataset of chest X-ray and CT images of patients
  which are positive or suspected of COVID-19 or other viral and bacterial
  pneumonias
  ([MERS](https://en.wikipedia.org/wiki/Middle_East_respiratory_syndrome),
  [SARS](https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome), and
  [ARDS](https://en.wikipedia.org/wiki/Acute_respiratory_distress_syndrome).).
contributor: OpenPAI
type: job
jobRetryCount: 0
prerequisites:
  - name: covid-chestxray-dataset
    type: data
    uri:
      - 'https://github.com/ieee8023/covid-chestxray-dataset.git'
  - name: default_image
    type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.4.0-gpu'
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    dockerImage: default_image
    data: covid-chestxray-dataset
    resourcePerInstance:
      cpu: 3
      memoryMB: 29065
      gpu: 1
    commands:
      - 'git clone <% $data.uri[0] %>'
defaults:
  virtualCluster: default
  1. test data prerequiste
protocolVersion: 2
name: pre1_f7a15a5c
type: job
jobRetryCount: 0
prerequisites:
  - type: script
    name: install-git
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: taskStarts
        commands:
          - apt update
          - apt install -y  git
  - type: data
    name: covid-19-data
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: taskStarts
        commands:
          - mkdir -p /dataset/covid-19
          - >-
            git clone https://github.com/ieee8023/covid-chestxray-dataset.git
            /dataset/covid-19
  - type: dockerimage
    uri: 'ubuntu:18.04'
    name: docker_image_0
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    prerequisites:
      - install-git
      - covid-19-data
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - ls -la /dataset/covid-19
defaults:
  virtualCluster: default
extras:
  com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true

expect: the data is successfully listed

  1. test parameter and secrets
protocolVersion: 2
name: pre_secret_parameters
type: job
jobRetryCount: 0
prerequisites:
  - type: script
    name: justecho
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: taskStarts
        commands:
          - echo <% $parameters.x %>
          - echo <% $secrets.y %>
  - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
parameters:
  x: '111'
taskRoles:
  taskrole:
    prerequisites: [justecho]
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - sleep 0s
secrets:
  'y': '222'
defaults:
  virtualCluster: default
extras:
  com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true

expect 111 and 222 in runtime.log

hzy46 commented 3 years ago

After discussion, the interaction between prerequisites and marketplace could be:

In taskrole, the prerequisites referenced from marketplace are defined directly. No need to include them in the job level prerequisites.

Use marketplace://data/xxx and marketplace://script/xxx to indicate data and script:

taskRoles:
  taskrole:
    prerequisites: ["marketplace://data/mnist"]

In job protocol, prerequisites can use require to indicate required items. The required items can be from job protocol or marketplace.

prerequisites:
  - type: script
    name: copy_data
    require: ["marketplace://script/pai_copy"]
    plugin: com.microsoft.pai.runtimeplugin.cmd
    callbacks:
      - event: containerStarts
        commands:
          - pai_copy data

Rest-server read all prerequisites in taskrole's prerequisites and covert them to the real definition by calling marketplace's api. This can be treated as job add-ons and saved in database.