hplt-project / OpusPocus

Marian machine translation training pipeline for thousands of models
2 stars 0 forks source link

OpusPocus on LUMI

Modular NLP pipeline manager.

OpusPocus is aimed at simplifying the description and execution of popular and custom NLP pipelines, including dataset preprocessing, model training and evaluation. The pipeline manager supports execution using simple CLI (Bash) or common HPC schedulers (Slurm, HyperQueue).

It uses OpusCleaner for data preparation and OpusTrainer for training scheduling (development in progress).

Structure

Installation

  1. Install MarianNMT.

  2. Prepare the OpusCleaner and OpusTrainer Python virtual environments.

  3. Install the OpusPocus requirements.

    pip install -r requirements.txt

Usage (Simple Pipeline)

See the examples/ directory for example execution

  1. Initialize the pipeline.

    $ ./go.py init \
    --pipeline-config path/to/pipeline/config/file \
    --pipeline-dir pipeline/destination/directory \
  2. Execute the pipeline.

    $ ./go.py run \
    --pipeline-dir pipeline/destination/directory \
    --runner bash \
  3. Check the pipeline status.

    $ ./go.py traceback --pipeline-dir pipeline/destination/directory

    OR

    $ ./go.py status --pipeline-dir pipeline/destination/directory