harvard-edge / dataperf-speech-example

Example workflow for our data-centric speech benchmark
17 stars 11 forks source link

Add MLCube integration #1

Closed davidjurado closed 2 years ago

davidjurado commented 2 years ago

DataPerf Speech Example - MLCube integration

Project setup

# Create Python environment and install MLCube Docker runner 
virtualenv -p python3 ./env && source ./env/bin/activate && pip install mlcube-docker

# Fetch the implementation from GitHub
git clone https://github.com/harvard-edge/dataperf-speech-example && cd ./dataperf-speech-example
git fetch origin pull/1/head:feature/MLCube-integration && git checkout feature/MLCube-integration

Project structure

Diagram

Tasks execution

# Run download task
mlcube run --task=download -Pdocker.build_strategy=always

# Run select task
mlcube run --task=select -Pdocker.build_strategy=always

# Run evaluate task
mlcube run --task=evaluate -Pdocker.build_strategy=always

Execute complete pipeline

# Run all steps
mlcube run --task=download,select,evaluate -Pdocker.build_strategy=always
colbybanbury commented 2 years ago

@davidjurado How would someone specify the workspace/ directory to MLCube?

Also is there a way to point to a file outside of workspace/? For example config_files/?

davidjurado commented 2 years ago

Hello @colbybanbury,

I'm sorry for the late reply, I didn't get a notification of your comment.

To specify a different workspace folder you can use --workspace and then provide the path, for example:

mlcube run --task=select --workspace=path/to/new_folder 

To point a file outside the workspace folder you need to have a parameter for the task you want to run, this is defined in the mlcube.yamlfile, for example, in the task select you have the following parameters:

select:
    # Run selection algorithm
    parameters:
      inputs:
        {
          allowed_training_set: { type: file, default: data/preliminary_evaluation_dataset/allowed_training_set.yaml },
          train_embeddings_dir: data/preliminary_evaluation_dataset/train_embeddings/,
        }
      outputs: { outdir: select_output/ }

and let's say we want to define a different allowed_training_set, we need to specify the name of the parameters to override and provide the absolute path of the new file we want to use:

mlcube run --task=select allowed_training_set=/Users/me/allowed_training_set.yaml