Synthesized alignments should be validated before producing a new sentence pair

In Marian, invalid alignments leads to a crash, as the index bounds for tokens is not checked. This breaks training. Plus, if alignments are generated incorrectly on the OpusTrainer side, this will degrade the final performance when using guided alignment training. It should be cheap and easy to validate that the alignments are within bounds. If they are out of bounds, then the sentence pair can be discarded with a warning.

This could be done like the following:

diff --git a/src/opustrainer/alignments.py b/src/opustrainer/alignments.py
index 6c8b316..05f8ca6 100644
--- a/src/opustrainer/alignments.py
+++ b/src/opustrainer/alignments.py
@@ -23,3 +23,19 @@ def format_alignments(pairs:List[Pair]) -> str:
     """Opposite of `parse_alignments`, turns a list of alignments back into the `a-b c-d ...` string
     format that most alignment tools expect."""
     return ' '.join(f'{pair.src}-{pair.trg}' for pair in pairs)
+
+
+def validate_alignments(source: List[str], target: List[str], alignments: List[Pair]):
+    for pair in alignments:
+        if (
+            pair.src < 0
+            or pair.src >= len(source)
+            or pair.trg < 0
+            or pair.trg >= len(target)
+        ):
+            raise Exception(
+                f"Alignments were not valid for:\n"
+                f"Source: {source}\n"
+                f"Target: {target}\n"
+                f"Alignments: {alignments}"
+            )
diff --git a/src/opustrainer/modifiers/placeholders.py b/src/opustrainer/modifiers/placeholders.py
index 4af71b0..9e23d63 100644
--- a/src/opustrainer/modifiers/placeholders.py
+++ b/src/opustrainer/modifiers/placeholders.py
@@ -3,7 +3,7 @@ from operator import attrgetter
 from pathlib import Path
 from typing import Set, List, Tuple, Optional, TypeVar, Iterable

-from opustrainer.alignments import Pair, parse_alignments, format_alignments
+from opustrainer.alignments import Pair, parse_alignments, format_alignments, validate_alignments
 from opustrainer.modifiers import Modifier
 from opustrainer.tokenizers import SpaceDetokenizer, SpaceTokenizer, MosesDetokenizer, SentencePieceTokenizer
 from opustrainer.modifiers.retokenize import Retokenizer, remap_alignment_pairs
@@ -394,6 +394,7 @@ class PlaceholderTagModifier(Modifier):

         if self.print_alignments:
             remapped_pairs = remap_alignment_pairs(source_mapping, target_mapping, alignments)
+            validate_alignments(source, target, remapped_pairs)
             return source_detok + "\t" + target_detok + "\t" + format_alignments(remapped_pairs)
         else:
             return source_detok + "\t" + target_detok

One draw-back, is that it still won't catch issues where the wrong tokenization strategy is used like in #53.

hplt-project / OpusTrainer

Synthesized alignments should be validated before producing a new sentence pair #54