Open bhaddow opened 4 months ago
Related to issue #22
Basically, with the Dataclasses and their inheritance we can setup pipeline config and pipeline steps in a following way: Config example:
- pipeline:
- src_lang: en
- tgt_lang: de
- steps:
-
- step: raw
step_label: gather.${global.src_lang}-${global.tgt_lang}
raw_data_dir: ${global.raw_data_dir}
- step: raw
step_label: valid.${global.src_lang}-${global.tgt_lang}
raw_data_dir: ${global.valid_data_dir}
tl;dr: We can get a reasonable simplification with Dataclasses and later we can consider some "syntactic sugar" for the most common step configurations not simplified byt the refactor
The dataclass implementation would then have a general "pipeline" dataclass (containing stuff line src_lang, tgt_lang) and "raw" step (and other steps) dataclass could, by default, inherit the "pipeline" values (src, tgt lang) if not overwriten by user. This would simplify config files when defining models/corpora in one direction. For the opposite direction, we would have to add either addtional optional arguments to pipeline steps (e.g., "reverse") or add some "fake" steps, such as BackwardTrainSteps, which would in practice create a regular TrainSteps with "rewired" arguments.
Hi
Looking at a sample config file, there is a lot of repetition. For example:
These
step
stanzas are all nested withinpipeline
. Why not specify the source and target language at thepipeline
level? Can this be done with OmegaConf? I think it can be done with a custom resolver, if not supported directly. It could make the config file much easier to read.