hplt-project / OpusTrainer

Curriculum training
https://pypi.org/project/opustrainer/
MIT License
16 stars 5 forks source link

Synthesized alignments should be validated before producing a new sentence pair #54

Open gregtatum opened 8 months ago

gregtatum commented 8 months ago

In Marian, invalid alignments leads to a crash, as the index bounds for tokens is not checked. This breaks training. Plus, if alignments are generated incorrectly on the OpusTrainer side, this will degrade the final performance when using guided alignment training. It should be cheap and easy to validate that the alignments are within bounds. If they are out of bounds, then the sentence pair can be discarded with a warning.

This could be done like the following:

diff --git a/src/opustrainer/alignments.py b/src/opustrainer/alignments.py
index 6c8b316..05f8ca6 100644
--- a/src/opustrainer/alignments.py
+++ b/src/opustrainer/alignments.py
@@ -23,3 +23,19 @@ def format_alignments(pairs:List[Pair]) -> str:
     """Opposite of `parse_alignments`, turns a list of alignments back into the `a-b c-d ...` string
     format that most alignment tools expect."""
     return ' '.join(f'{pair.src}-{pair.trg}' for pair in pairs)
+
+
+def validate_alignments(source: List[str], target: List[str], alignments: List[Pair]):
+    for pair in alignments:
+        if (
+            pair.src < 0
+            or pair.src >= len(source)
+            or pair.trg < 0
+            or pair.trg >= len(target)
+        ):
+            raise Exception(
+                f"Alignments were not valid for:\n"
+                f"Source: {source}\n"
+                f"Target: {target}\n"
+                f"Alignments: {alignments}"
+            )
diff --git a/src/opustrainer/modifiers/placeholders.py b/src/opustrainer/modifiers/placeholders.py
index 4af71b0..9e23d63 100644
--- a/src/opustrainer/modifiers/placeholders.py
+++ b/src/opustrainer/modifiers/placeholders.py
@@ -3,7 +3,7 @@ from operator import attrgetter
 from pathlib import Path
 from typing import Set, List, Tuple, Optional, TypeVar, Iterable

-from opustrainer.alignments import Pair, parse_alignments, format_alignments
+from opustrainer.alignments import Pair, parse_alignments, format_alignments, validate_alignments
 from opustrainer.modifiers import Modifier
 from opustrainer.tokenizers import SpaceDetokenizer, SpaceTokenizer, MosesDetokenizer, SentencePieceTokenizer
 from opustrainer.modifiers.retokenize import Retokenizer, remap_alignment_pairs
@@ -394,6 +394,7 @@ class PlaceholderTagModifier(Modifier):

         if self.print_alignments:
             remapped_pairs = remap_alignment_pairs(source_mapping, target_mapping, alignments)
+            validate_alignments(source, target, remapped_pairs)
             return source_detok + "\t" + target_detok + "\t" + format_alignments(remapped_pairs)
         else:
             return source_detok + "\t" + target_detok

One draw-back, is that it still won't catch issues where the wrong tokenization strategy is used like in #53.

gregtatum commented 8 months ago

It looks like parse_alignments has some validation:

    if src_tokens is not None and trg_tokens is not None:
        for pair in pairs:
            if pair.src < 0 or pair.src >= len(src_tokens) \
            or pair.trg < 0 or pair.trg >= len(trg_tokens):
                raise ValueError('Out-of-bound alignment pairs')

But I think the alignments should be validated again after being modified. I would also prefer to output to a logger for anything that didn't get validated. It's hard to debug when you are dealing with tens of millions of sentence pairs.