unfoldingWord / wordMAP

Multilingual Word Alignment Prediction
https://wordmap.netlify.com
6 stars 1 forks source link

alignment ouptut idea. #5

Open da1nerd opened 6 years ago

da1nerd commented 6 years ago

We need to support alignments across verses. Here are three possible solutions.

{
          "confidence": 0.516905944153279,
          "sourceNgram": [0],
          "targetNgram": [10, { // add an object
            "position": 0,
            "verse": 2,
            "chapter": 3
          }],
          // or separate object
          "versification": {
            "target": {
              "nextVerseId": 1
            },
            "source": {

            }
          }
        },

or we can just keep the number ids and format it like this 11001 e.g. chapter 11 and verse 1.

da1nerd commented 6 years ago

This is better

{
          "confidence": 0.516905944153279,
          "sourceNgram": [0],
          "targetNgram": [10, 0, 1, 10001]
        },
jag3773 commented 6 years ago

@neutrinog Can you explain what the values in [10, 0, 1, 10001] refer to?

da1nerd commented 6 years ago

NOTE: I think the 10 was in there by accident.

Structure

Given a context of verse 9, and that verse 9 contains two tokens, here is an alignment of three tokens from the target text to one token in the source text:

{
  "confidence": 0.516905944153279,
  "sourceNgram": [0],
  "targetNgram": [0, 1, 10001]
}

Within the targetNgram we see two tokens from verse 9 indicated by the positional values 0 and 1. Additionally we include the second (zero indexed) token from verse 10 in this alignment indicated by 10001.

Rules

Referring to tokens outside of the current context proceeds as follows:

Prepend additional context as required.

NOTE: it is not supported, and we believe unnecessary to align tokens across different books.

As additional context is appended, the previous context must be zero filled to three digits.

Here the above example is shown in it's expanded, simplified, and parsed forms:

chapter verse token
000 010 001 expanded
10 001 simple
10 1 parsed

Parsing such a value is done by casting the value as a string and splitting it in chunks, 3 characters in length, originating from the end (right side).

da1nerd commented 6 years ago

@klappy fyi, I included this description ^

da1nerd commented 5 years ago

The most recent approach being considered involves storing a context id inside of the tokens. This will allow wordMap to be agnostic to the concept of crossing verse and chapter boundaries.

For example, here is a contrived example where wordMap has received two tokens:

{
              "text": "Lord",
              "occurrence": 1,
              "occurrences": 1,
              "contextId": "BOOK001001"
}
{
              "text": "The",
              "occurrence": 1,
              "occurrences": 1,
              "contextId": "BOOK001002"
}

In this case token at index 0 is from verse 1 and token at index 1 is from verse 2. wordMap will be able to process these tokens like normal and the alignment will contain these token objects for later reference. See https://github.com/translationCoreApps/wordMAP/commit/449906816fde6444033f67915dc32faf4daf9b53 as an example for passing the token object to the output.

With this method it should be noted that cross verse alignment would not be supported (at least not in a deterministic way) with simple string input to wordMap. The input must pre-tokenized with the context id added as needed.

da1nerd commented 5 years ago

@PhotoNomad0 :point_up: