hplt-project / OpusPocus

Marian machine translation training pipeline for thousands of models
2 stars 0 forks source link

Config files contain a lot of repetition - can this be avoided? #27

Open bhaddow opened 4 months ago

bhaddow commented 4 months ago

Hi

Looking at a sample config file, there is a lot of repetition. For example:

    - step: raw
      step_label: gather.${global.src_lang}-${global.tgt_lang}
      src_lang: ${global.src_lang}
      tgt_lang: ${global.tgt_lang}
      raw_data_dir: ${global.raw_data_dir}
    - step: raw
      step_label: valid.${global.src_lang}-${global.tgt_lang}
      src_lang: ${global.src_lang}
      tgt_lang: ${global.tgt_lang}
      raw_data_dir: ${global.valid_data_dir}

These step stanzas are all nested within pipeline . Why not specify the source and target language at the pipeline level? Can this be done with OmegaConf? I think it can be done with a custom resolver, if not supported directly. It could make the config file much easier to read.

varisd commented 4 months ago

Related to issue #22

Basically, with the Dataclasses and their inheritance we can setup pipeline config and pipeline steps in a following way: Config example:

- pipeline:
  - src_lang: en
  - tgt_lang: de 
  - steps:
  - 
    - step: raw
      step_label: gather.${global.src_lang}-${global.tgt_lang}
      raw_data_dir: ${global.raw_data_dir}
    - step: raw
      step_label: valid.${global.src_lang}-${global.tgt_lang}
      raw_data_dir: ${global.valid_data_dir}

tl;dr: We can get a reasonable simplification with Dataclasses and later we can consider some "syntactic sugar" for the most common step configurations not simplified byt the refactor

The dataclass implementation would then have a general "pipeline" dataclass (containing stuff line src_lang, tgt_lang) and "raw" step (and other steps) dataclass could, by default, inherit the "pipeline" values (src, tgt lang) if not overwriten by user. This would simplify config files when defining models/corpora in one direction. For the opposite direction, we would have to add either addtional optional arguments to pipeline steps (e.g., "reverse") or add some "fake" steps, such as BackwardTrainSteps, which would in practice create a regular TrainSteps with "rewired" arguments.