vistec-AI / mt-opus

English-Thai Machine Translation with OPUS data
19 stars 5 forks source link

Create a script to evaluate MT models on Wang's dataset #3

Open lalital opened 5 years ago

lalital commented 5 years ago

Todos:


BLEU score evaluation:

Given the source and reference sentences that are not tokenized

  1. Tokenize the source and reference sentences with newmm and stored them. For example:

    • TH sentence: "ฉันไปโรงเรียน"["ฉัน", "ไป", "โรง", "เรียน"]

    • EN sentence: "I go to school."["I", "go", "to", "school", "."]

  2. Feed the untokenized source sentences to NMT models.

    • th → en

      • word → word

        1. Tokenize the source sentences with a word-level tokenizer (e.g. newmm).
        2. Feed tokenized text to NMT model.
        3. Concatenate the predicted translation with spaces.
          For example, ["I", "go", "to", "my", "school."]"I go to my school.".
        4. Tokenize the concatenated text by newmm tokenizer (with the same configuration from (1)). For example, "I go to my school."["I", "go" ,"to", "my" "school", "."]
        5. Use the tokenized predicted translation and compute BLEU score with the reference sentences tokenized from (1)
      • subword → word

        1. Tokenize the source sentences with a subword-level tokenizer (e.g. SentencePiece).
        2. Feed tokenized text to NMT model.
        3. Concatenate the predicted translation with spaces.
          For example, ["I", "go", "to", "my", "school."]"I go to my school".
        4. Tokenize the concatenated text by newmm tokenizer (with the same configuration from (1)). For example, "I go to my school."["I", "go" ,"to", "my" "school", "."]
        5. Use the tokenized predicted translation and compute BLEU score with the reference sentences tokenized from (1)
      • word → subword

        1. Tokenize the source sentences with a word-level tokenizer (e.g. newmm).

        2. Feed tokenized text to NMT model.

        3. Apply the BPE removing operation to predicted translation For example,["_I", "_go", "_to", "_my", "_sch", "ool", "."] to "I go to my school.".

        4. Tokenize the text after BPE was removed by newmm tokenizer (with the same configuration from (1)). For example, "I go to my school."["I", "go" ,"to", "my" "school", "."]

        5. Use the tokenized predicted translation and compute BLEU score with the reference sentences tokenized from (1)

      • subword → subword

        1. Tokenize the source sentences with a subword-level tokenizer (e.g. SentencePiece).
        2. Feed tokenized text to NMT model.
        3. Apply the BPE removing operation to predicted translation For example,["_I", "_go", "_to", "_my", "_sch", "ool", "_"] to "I go to my school.".
        4. Tokenize the text after BPE was removed by newmm tokenizer (with the same configuration from (1)). For example, "I go to my school."["I", "go" ,"to", "my" "school", "."]
        5. Use the tokenized predicted translation and compute BLEU score with the reference sentences tokenized from (1)
    • en → th

      • word → word

        1. Tokenize the source sentences with a word-level tokenizer (e.g. newmm).
        2. Feed tokenized text to NMT model.
        3. Concatenate the predicted translation with spaces.
          For example, ["ฉัน", "ไปที่", "โรงเรียน", "มัธยม"]"ฉัน ไปที่ โรงเรียน มัธยม".
        4. Tokenize the concatenated text by newmm tokenizer (with the same configuration from (1)). For example, "ฉัน ไปที่ โรงเรียน มัธยม"["ฉัน", "ไปที่", "โรง", "เรียน", "มัธยม"] .
        5. Use the tokenized predicted translation and compute BLEU score with the reference sentences tokenized from (1)
      • subword → word

        1. Tokenize the source sentences with a subword-level tokenizer (e.g. SentencePiece).
        2. Feed tokenized text to NMT model.
        3. Concatenate the predicted translation with spaces.
          For example, ["ฉัน", "ไปที่", "โรงเรียน", "มัธยม"]"ฉัน ไปที่ โรงเรียน มัธยม".
        4. Tokenize the concatenated text by newmm tokenizer (with the same configuration from (1)). For example, "ฉัน ไปที่ โรงเรียน มัธยม"["ฉัน", "ไปที่", "โรง", "เรียน", "มัธยม"] .
        5. Use the tokenized predicted translation and compute BLEU score with the reference sentences tokenized from (1)
      • word → subword

        1. Tokenize the source sentences with a word-level tokenizer (e.g. newmm).
        2. Feed tokenized text to NMT model.
        3. Apply the BPE removing operation to predicted translation. For example, ["_ฉัน", "ไป", "_ที่", "_โรงเรียน", "มัถยม"]"ฉันไป ที่ โรงเรียนมัธยม")`
        4. Tokenize the concatenated text by newmm tokenizer (with the same configuration from (1)). For example, "ฉันไป ที่ โรงเรียนมัธยม"["ฉัน", "ไป", "ที่", "โรง","เรียน", "มัธยม"])`
        5. Use the tokenized predicted translation and compute BLEU score with the reference sentences tokenized from (1)
      • subword → subword

        1. Tokenize the source sentences with a subword-level tokenizer (e.g. SentencePiece).
        2. Feed tokenized text to NMT model.
        3. Apply the BPE removing operation to predicted translation For example, ["_ฉัน", "ไป", "_ที่", "_โรงเรียน", "มัถยม"]"ฉันไป ที่ โรงเรียนมัธยม")`
        4. Tokenize the text after BPE was removed by newmm tokenizer (with the same configuration from (1)). For example, "ฉันไป ที่ โรงเรียนมัธยม"["ฉัน", "ไป", "ที่", "โรง","เรียน", "มัธยม"])`
        5. Use the tokenized predicted translation and compute BLEU score with the reference sentences tokenized from (1)

Example:

th → en

en → th