Open hzy46 opened 3 years ago
Update of this issue:
v1.5.0
release.prerequisites
: (1) use existing mechanism to inject commands into preCommands
and postCommands
(2) don't show explicit plugin definition in user's job protocol (3) make sure parameters and secrets work in prerequisites
.Question about this:
prerequisites
? If so we could deprecated runtime extra field. Make prerequisites
the official way? prerequisites
as a high level widget. It can use any implementation they want to achieve. Such as runtime-plugin or inject-command or communicate with other k8s/pai services. And there is other proposal for service plugin: https://github.com/microsoft/pai/issues/4254 does it related?Question about this:
- Can all runtime plugins merge to
prerequisites
? If so we could deprecated runtime extra field. Makeprerequisites
the official way?- Maybe I can treat
prerequisites
as a high level widget. It can use any implementation they want to achieve. Such as runtime-plugin or inject-command or communicate with other k8s/pai services. And there is other proposal for service plugin: #4254 does it related?
I think they are used for different scenarios. Prerequisite
is the requirement for a job. Without a prerequisite
, a job usually fails. And prerequisite
should be sharable among users. Runtime plugin is used to extend job protocol's functions. It can be nice-to-have (not necessary), and can be personal config (not sharable). There are some overlaps. Maybe we can move some officially-supported runtime plugin into prerequisites
.
I have updated the full spec in the main body and here are some examples, including
prerequisites:
- name: install-pai-copy
type: script # indicate the purpose, not used by backend but for statistical analyzing (except dockerimage)
plugin: com.microsoft.pai.runtimeplugin.cmd # default plugin if not specified
callbacks:
- event: containerStart
commands:
- xxx # commands to setup nodejs
- npm install -g @swordfaith/pai_copy
failurePolicy: ignore/fail
- name: covid_data
type: data
plugin: com.microsoft.pai.runtimeplugin.cmd #
callbacks:
- event: containerStart
commands:
- mkdir -p /data/covid19/data/
- cd /data/covid19/data/
- 'wget https://x.x.x/yyy.zip'
- export DATA_DIR=/data/covid19/data/
- name: nfs-storage-1
type: storage # indicate the purpose, not used by backend but for statistical analyzing
plugin: com.microsoft.pai.rest.storage # handled by REST server
config: nfsconfig # special arguments for storage plugin only
mountPoint: /mnt/nfs-storage-1
- name: mnist-data
type: data
plugin: com.microsoft.pai.runtimeplugin.cmd
require:
- nfs-storage-1 # also inherit parameters like mountPoint
callbacks:
- event: containerStart
commands:
- export MNIST_DIR=<% this.mountPoint %>/mnist
- name: output-dir
type: output
plugin: com.microsoft.pai.runtimeplugin.cmd
require:
- nfs-storage-1
callbacks:
- event: containerStart
commands:
- export OUTPUT_DIR=/tmp/output
- event: containerExit
commands:
- 'if [ -z ${OUTPUT_DIR+x}]; then'
- echo "Not found OUTPUT_DIR environ"
- else
- pai_copy upload paiuploadtest //
- fi
- name: enable-ssh
type: script
plugin: com.microsoft.pai.runtimeplugin.ssh
jobssh: true
publicKeys: # optional, if not specified, only public keys in user.extensions.sshKeys will be added
- ... # public keys
taskRoles:
taskrole:
dockerImage: default_image
prerequisites:
- mnist-data # required will be automatically parsed and added in backend
- output-dir
(TBD) Test Cases for v1.5.0 release
protocolVersion: 2
name: pre1
type: job
jobRetryCount: 0
prerequisites:
- type: script
name: justecho
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- echo 111
- echo 222
- event: taskSucceeds
commands:
- echo 333
- echo 444
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
prerequisites:
- justecho
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- sleep 0s
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
expected runtime.log
:
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] [INFO] Starting to exec precommands
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] 111
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] 222
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] [package_cache] Skip installation of group ssh.
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] [INFO] start ssh service
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] * Restarting OpenBSD Secure Shell server sshd
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] ...done.
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] [INFO] Precommands finished
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] [INFO] USER COMMAND START
[Mon Feb 1 02:54:58 UTC 2021] [openpai-runtime] [INFO] USER COMMAND END
[Mon Feb 1 02:54:58 UTC 2021] [openpai-runtime] 333
[Mon Feb 1 02:54:58 UTC 2021] [openpai-runtime] 444
2. test multiple prerequisites
```yaml
protocolVersion: 2
name: pre2
type: job
jobRetryCount: 0
prerequisites:
- type: script
name: justecho_first
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- echo 111
- event: taskSucceeds
commands:
- echo 222
- type: script
name: justecho_later
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- echo aaa
- event: taskSucceeds
commands:
- echo bbb
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
prerequisites:
- justecho_first
- justecho_later
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- sleep 0s
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
expected runtime.log
:
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] [INFO] Starting to exec precommands
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] 111
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] aaa
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] [package_cache] Skip installation of group ssh.
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] [INFO] start ssh service
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] * Restarting OpenBSD Secure Shell server sshd
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] ...done.
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] [INFO] Precommands finished
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] [INFO] USER COMMAND START
[Mon Feb 1 02:55:27 UTC 2021] [openpai-runtime] [INFO] USER COMMAND END
[Mon Feb 1 02:55:27 UTC 2021] [openpai-runtime] bbb
[Mon Feb 1 02:55:27 UTC 2021] [openpai-runtime] 222
test wrong config 1 Error is expected:
protocolVersion: 2
name: pre3
type: job
jobRetryCount: 0
prerequisites:
- type: script
name: justecho
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- echo 111
- echo 222
- event: taskSucceeds
commands:
- echo 333
- echo 444
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
prerequisites:
- justecho_wrong
dockerImage: docker_image_0
resourcePerInstance:
gpu: 0
cpu: 1
memoryMB: 9672
commands:
- sleep 0s
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
test backward-compatibility
This job should work:
protocolVersion: 2
name: covid-chestxray-dataset_88170423
description: >
COVID-19 chest X-ray image data collection
It is to build a public open dataset of chest X-ray and CT images of patients
which are positive or suspected of COVID-19 or other viral and bacterial
pneumonias
([MERS](https://en.wikipedia.org/wiki/Middle_East_respiratory_syndrome),
[SARS](https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome), and
[ARDS](https://en.wikipedia.org/wiki/Acute_respiratory_distress_syndrome).).
contributor: OpenPAI
type: job
jobRetryCount: 0
prerequisites:
- name: covid-chestxray-dataset
type: data
uri:
- 'https://github.com/ieee8023/covid-chestxray-dataset.git'
- name: default_image
type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.4.0-gpu'
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
dockerImage: default_image
data: covid-chestxray-dataset
resourcePerInstance:
cpu: 3
memoryMB: 29065
gpu: 1
commands:
- 'git clone <% $data.uri[0] %>'
defaults:
virtualCluster: default
protocolVersion: 2
name: pre1_f7a15a5c
type: job
jobRetryCount: 0
prerequisites:
- type: script
name: install-git
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- apt update
- apt install -y git
- type: data
name: covid-19-data
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- mkdir -p /dataset/covid-19
- >-
git clone https://github.com/ieee8023/covid-chestxray-dataset.git
/dataset/covid-19
- type: dockerimage
uri: 'ubuntu:18.04'
name: docker_image_0
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
prerequisites:
- install-git
- covid-19-data
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- ls -la /dataset/covid-19
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
expect: the data is successfully listed
protocolVersion: 2
name: pre_secret_parameters
type: job
jobRetryCount: 0
prerequisites:
- type: script
name: justecho
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- echo <% $parameters.x %>
- echo <% $secrets.y %>
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
parameters:
x: '111'
taskRoles:
taskrole:
prerequisites: [justecho]
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- sleep 0s
secrets:
'y': '222'
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
expect 111 and 222 in runtime.log
After discussion, the interaction between prerequisites and marketplace could be:
In taskrole, the prerequisites referenced from marketplace are defined directly. No need to include them in the job level prerequisites.
Use marketplace://data/xxx
and marketplace://script/xxx
to indicate data and script:
taskRoles:
taskrole:
prerequisites: ["marketplace://data/mnist"]
In job protocol, prerequisites can use require
to indicate required items. The required items can be from job protocol or marketplace.
prerequisites:
- type: script
name: copy_data
require: ["marketplace://script/pai_copy"]
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: containerStarts
commands:
- pai_copy data
Rest-server read all prerequisites in taskrole's prerequisites and covert them to the real definition by calling marketplace's api. This can be treated as job add-ons and saved in database.
Motivation
[OpenPAI protocol]() support users to specify
prerequisites
(e.g.dockerimage
,data
, andscript
) and then reference them intaskrole
. There are some limitations in current version.uri
) definition. This is enough for the most frequently useddockerimage
because docker plays a role of corresponding runtime executor. However, it is too limited for other types. For example, commands has to be injected in every taskrole to make the data ready in the job config below.wget
is actions with the data, but it could not be placed together.wget
commands must be injected everywhere.Goal
prerequisites
be well organized and object-oriented. Besides defining parameters, it also supports real functions (callbacks on specific events).prerequisites
Proposal
prerequisites
prerequisites
Examples
taskRoles: taskrole: dockerImage: default_image prerequisites:
Full Spec:
Each of
prerequisites
will be handled in a way like