sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
30 stars 4 forks source link

When translating on a USFM file with a table spanning multiple verses, the output USFM drops some verse markers and table formatting markers #329

Open mmartin9684-sil opened 5 months ago

mmartin9684-sil commented 5 months ago

When the translate script is run on a USFM file with a table that spans multiple verses, the output USFM file can omit mid-table verse markup as well as some of the table formatting markers.

As an example, the first attached file shows the USFM source for the NIV11R's version of NEH 7:7-39. A single table spans verses 7-38, and a second table begins in verse 39. The second attached file shows the translated USFM generated by the translate script, in which the verse markers for verses 8-37 are missing ("\v 7" through "\v 38"); in addition, the table format markers "\tc1" and "\tcr2" are missing from the output. The output USFM doesn't include verse markers until verse 39, in which the next table starts. NIV11R (NEH 7.7-39).txt Mashi Draft (NEH 7.7-39).txt

mmartin9684-sil commented 5 months ago

Another similar example. Here's the source text in the NIV11R input for NEH 7:45:

\v 45 The gatekeepers of the temple: \tr \tc1 the descendants of \tr \tc1 ~~ Shallum, Ater, Talmon, \tr \tc1 ~~Akkub, Hatita and Shobai \tcr2 138 \b \li4 \v 46 The temple servants:

And here's the translated output:

\v 45 Akwakunyunga thsero thzo Tembere: \tr anahoka a \tr Shalumi, Atere, Talumone, \tr Dikanda dya Akubusi, Hatita na Shoba \b \li4 \v 46 Angamba o Tembere:

Note the loss of the "~~" text in the 2nd & 3rd rows of the table. The 3rd row is also missing the number ('138') and the table formatting marker ("\tcr2").

ddaspit commented 5 months ago

As a part of this issue, we should move to Machine for USFM parsing and generation. I believe that Machine supports table markers better.

mmartin9684-sil commented 5 months ago

Would there be any other types of USFM markers where it would be helpful for us to double-check the way the markers are being handled in the drafts? Maybe we could do a little survey of what we're seeing in the drafts for markers where you suspect there could be problems.

ddaspit commented 5 months ago

The least tested USFM markers would probably be tables, milestones, spacing and break markers, and study Bible markers, such as sidebars. Machine should have better support for most of these markers.