sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
30 stars 3 forks source link

Translate script doesn't support USFM file names using the "<nn>-<book>" naming format #456

Open mmartin9684-sil opened 1 month ago

mmartin9684-sil commented 1 month ago

Paratext-compatible projects downloaded from Door43 (e.g., https://git.door43.org/unfoldingWord/el-x-koine_ugnt) use the "-" book name format for naming the USFM files in the project. For example:

When the translate script is run with one of these projects as the source projects, the script will error out because it doesn't properly handle this book naming format:

2024-07-13 07:58:52,216 - silnlp.nmt.translate - ERROR - Was not able to translate MIC.
Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/translate.py", line 122, in translate_books
    translator.translate_book(
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/common/translator.py", line 317, in translate_book
    raise RuntimeError(f"Can't find file {book_path} for book {book}")
RuntimeError: Can't find file /tmp/tmpu35zij1b/hbo_uhb_2024_07_10/33MIC.usfm for book MIC
mmartin9684-sil commented 1 month ago

Note that these projects do not contain a Settings.xml file when they are downloaded. If a minimal Settings.xml file is created for them, it would look like this:

<ScriptureText>
  <Versification>4</Versification>
  <FileNamePostPart>.usfm</FileNamePostPart>
  <FileNameBookNameForm>41-MAT</FileNameBookNameForm>
  <LanguageIsoCode>hbo:::</LanguageIsoCode>
  <BiblicalTermsListSetting>Major::BiblicalTerms.xml</BiblicalTermsListSetting>
  <Naming PrePart="" PostPart=".usfm" BookNameForm="41-MAT" />
</ScriptureText>

But, although the BookNameForm is properly specified to match the USFM file names in the project folder, this file naming format is not properly handled and the files can't be opened for translation.

ddaspit commented 1 month ago

Door43 has its own metadata format for translations called resource containers. We should add support for extracting from Door43 resource containers.