sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
30 stars 4 forks source link

Translate script - parsing error on cross-references (\xt ... \xt*) in section reference lines (\sr) #367

Open mmartin9684-sil opened 3 months ago

mmartin9684-sil commented 3 months ago

Various USFM parsing errors occur when the translate script is run on a USFM file that uses cross-references (\xt <text>\xt*) as part of a section reference line (\sr <text>). For instance, in the NASV project, these section header and section reference lines in the book of RUT will cause a parsing error:

\s1 የዳዊት የትውልድ ሐረግ
\sr 4፥18 ተጓ ምብ – \xt 1ዜና 2፥5-15፤ ማቴ 1፥3-6፤ ሉቃ 3፥31-33\xt*

None of the stylesheet-field-update options (merge, ignore, replace) can be used to work around the parsing error.

As sample stack trace when this error occurs:

2024-04-17 16:49:14,464 - silnlp.nmt.translate - ERROR - Was not able to translate RUT.
Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/translate.py", line 116, in translate_books
    translator.translate_book(
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/common/translator.py", line 321, in translate_book
    self.translate_usfm(
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/common/translator.py", line 420, in translate_usfm
    update_segments(segments, translations)
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/common/translator.py", line 141, in update_segments
    first_para = next(p for p in segment.paras if len(p.child_indices) > 0)
StopIteration
mshannon-sil commented 3 months ago

I tried using different combinations of the stylesheet-field-update options and versions of Ruth with and without the \xt markers in \sr lines, and it seems that there's actually a couple issues. The \xt markers in the \sr lines is the reason it fails with the ignore option, although that's just because it's ignoring the project's custom stylesheet. When using either of the other two options (merge or replace), the issue with the \xr marker doesn't occur anymore since it's now using the project's custom stylesheet. However, instead there's an issue with the \b marker, and this is the same issue that motivated introducing the ignore option in the first place. The sample stack trace corresponds to this second issue.

Seeing as how I'm going to be replacing the USFM parser in SILNLP with the one from machine.py, I'll take a look at addressing this issue as part of that process.