Closed mmartin9684-sil closed 2 months ago
Another similar example. Here's the source text in the NIV11R input for NEH 7:45:
\v 45 The gatekeepers of the temple: \tr \tc1 the descendants of \tr \tc1 ~~ Shallum, Ater, Talmon, \tr \tc1 ~~Akkub, Hatita and Shobai \tcr2 138 \b \li4 \v 46 The temple servants:
And here's the translated output:
\v 45 Akwakunyunga thsero thzo Tembere: \tr anahoka a \tr Shalumi, Atere, Talumone, \tr Dikanda dya Akubusi, Hatita na Shoba \b \li4 \v 46 Angamba o Tembere:
Note the loss of the "~~" text in the 2nd & 3rd rows of the table. The 3rd row is also missing the number ('138') and the table formatting marker ("\tcr2").
As a part of this issue, we should move to Machine for USFM parsing and generation. I believe that Machine supports table markers better.
Would there be any other types of USFM markers where it would be helpful for us to double-check the way the markers are being handled in the drafts? Maybe we could do a little survey of what we're seeing in the drafts for markers where you suspect there could be problems.
The least tested USFM markers would probably be tables, milestones, spacing and break markers, and study Bible markers, such as sidebars. Machine should have better support for most of these markers.
Using the machine.py parser, tables are handled much better, at least for these examples. See the attached file. All of the markers are preserved, though like in other situations where there are multiple markers in a verse, all of the pieces of the translated verse are inserted into the first text element which is directly after the verse marker. The "~~" bits are still missing, but I suspect that's related to the normalization done by the NLLB tokenizer, akin to the issues in #297.
When the translate script is run on a USFM file with a table that spans multiple verses, the output USFM file can omit mid-table verse markup as well as some of the table formatting markers.
As an example, the first attached file shows the USFM source for the NIV11R's version of NEH 7:7-39. A single table spans verses 7-38, and a second table begins in verse 39. The second attached file shows the translated USFM generated by the translate script, in which the verse markers for verses 8-37 are missing ("\v 7" through "\v 38"); in addition, the table format markers "\tc1" and "\tcr2" are missing from the output. The output USFM doesn't include verse markers until verse 39, in which the next table starts. NIV11R (NEH 7.7-39).txt Mashi Draft (NEH 7.7-39).txt