microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.64k stars 548 forks source link

[User Story] Dataset: integrate data prerequisite into marketplace and job submission page #5345

Open hzy46 opened 3 years ago

hzy46 commented 3 years ago

Motivation

5145 has extended the prerequisite field. But users can only use and share prerequisites in job yaml. We can support UI for prerequistes, especially for data prerequisite. This issue will explain how the users create and use a data prerequisite in the cluster. With this feature, cluster users can easily share datasets with each other, and it may benefit future features e.g. dataset caching and optimization.

Explanation

How do users create a dataset in the cluster?

Dataset item that doesn't need a PVC storage

The user should create a dataset item in marketplace. dataset item has a prerequisite spec and other misc info (e.g. title, usage) in marketplace.

If the dataset is just downloaded from the Internet, it should have the following spec:

name: mnist
type: data
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
   commands:
    - wget "<.....>" -O /dataset/mnist/<...>

Dataset item that needs a PVC storage

If the dataset is already saved in a PVC, it should have the following spec:

name: imagenet
type: data
plugin: com.microsoft.pai.runtimeplugin.cmd
requireStorages:
- name: confignfs
  mountPath: /mnt/confignfs
callbacks:
- event: taskStarts
   commands:
    - ln -s "/dataset/imagenet" "/mnt/confignfs/users/mine/presaved-imagenet"

Here we define a new field: requireStorages. It shares the same spec as the current implementation. If this prerequisite is included in a job, we should merge the storage field here with other PVC storage.

How do users use dataset in the cluster?

On marketplace pages

On marketplace pages, users can click use to create an empty job with the corresponding dataset.

image

On job submission page

On job submission page, users can select his/her dataset by the field under taskrole section.

image

How to represent marketplace prerequisite in job yaml?

The dataset prerequisite from marketplace will be expressed as marketplace://prerequisites/itemId/<item-id>

One example is as follows:

taskRoles:
  taskrole:
    prerequisites: ["marketplace://prerequisites/itemId/1"]
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - echo 1

The webportal page should provide a link to marketplace for the user.

After submission, rest-server will parse these marketplace items and pass them to db controller and runtime. Rest-server should also take care of requireStorages, and merge it with other storage spec carefully.

The following errors can happen in rest-server:

Other features

We can enable urls like http(s):// in addition to marketplace://. It will bring a lot of convenience and easy to implement.

taskRoles:
  taskrole:
    prerequisites: ["https://raw.githubusercontent.com/microsoft/pai/master/contrib/xxxx.yml"]
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - echo 1

Implementation

hzy46 commented 3 years ago

Main Design Ideas

Examples

Set up a mnist dataset

# in marketplace
- name: install_wget
  plugin: cmd
  plugin_params:
    callbacks:
      - event: taskStarts
        commands:
          - "apt update"
          - "apt install -y wget"

# in marketplace
- name: mnist
  require:
    - name: marketplace://name/install_wget
  plugin: cmd
  plugin_params:
    callbacks:
      - event: taskStarts
        commands:
          - mkdir -p {{ dataPath }}
          - wget http://1.2.3.4/mnist.zip -O {{ dataPath }}
          - cd {{ dataPath }}
          - unzip mnist.zip
  template_variables:
    - name: dataPath

# in job
prerequisites:
  - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    prerequisites:
      - mnist
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - sleep 0s
defaults:
  virtualCluster: default
extras:
  reference_prerequisites:
    - name: mnist
      require:
        - name: marketplace://name/mnist
          template_variables:
            dataPath: /dataset/mnist

Set up a imagenet dataset

# set up a imagenet

#  in marketplace
- name: confignfs_pvc
  plugin: pvc_storage
  plugin_params:
     name: confignfs
     mountPath: {{ mountPath }}
  template_variables:
    - name: mountPath

# in marketplace
- name: imagenet
  require: # if the required prerequisite has template_variables, all the template_variables MUST be fulfilled.
    - name: marketplace://name/confignfs_pvc
      template_variables:
        mountPath: /mnt/confignfs_pvc
  plugin: cmd
  plugin_params:
    callbacks:
    - event: taskStarts
      commands:
      - mkdir -p {{ dataPath }}
      - cp -r /mnt/confignfs_pvc/imagenet/* {{ dataPath }}
  template_variables:
    - name: dataPath

# in marketplace
- name: imagenet_only_validation
  require: # if the required prerequisite has template_variables, all the template_variables MUST be fulfilled.
    - name: marketplace://name/confignfs_pvc
      template_variables:
        mountPath: /mnt/confignfs_pvc
  plugin: cmd
  plugin_params:
    callbacks:
    - event: taskStarts
      commands:
      - mkdir -p {{ dataPath }}
      - cp -r /mnt/confignfs_pvc/imagenet/validation/* {{ dataPath }}
  template_variables:
    - name: dataPath

# in job
prerequisites:
  - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    prerequisites:
      - imagenet
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - sleep 0s
defaults:
  virtualCluster: default
extras:
  reference_prerequisites:
  - name: imagenet
    require:
    - name: marketplace://name/imagenet
      template_variables:
        dataPath: /dataset/imagenet

Set up a debug hook

# set up a debug hook

#  in marketplace
- name: debug_hook
  plugin: cmd
  plugin_params:
    callbacks:
      - event: taskFails
        commands:
          - echo "will sleep for {{ min }} minutes for debugging..."
          - sleep {{ min }}m
  template_variables:
      - name: min

# in job
prerequisites:
  - type: dockerimage
    uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
    name: docker_image_0
taskRoles:
  taskrole:
    instances: 1
    completion:
      minFailedInstances: 1
    taskRetryCount: 0
    prerequisites:
      - debug_hook
    dockerImage: docker_image_0
    resourcePerInstance:
      gpu: 1
      cpu: 3
      memoryMB: 29065
    commands:
      - sleep 0s
defaults:
  virtualCluster: default
extras:
  reference_prerequisites:
  - name: debug_hook
    require:
    - name: marketplace://name/debug_hook
      template_variables:
        min: 30