sillsdev / silnlp

A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.
Other
30 stars 4 forks source link

ValueError: Some books failed to translate #444

Open jcuenod opened 2 weeks ago

jcuenod commented 2 weeks ago

I'm repeatedly hitting these with various sources. It seems to be some sort of USFM issue (based on the SyntaxError in the middle).

Here's one translating the book of Psalms using the NIV11 as a source:

2024-07-08 08:55:00,728 - silnlp.nmt.translate - INFO - Translating PSA ...
2024-07-08 08:55:00,777 - silnlp.common.translate - INFO - Found the file /home/klaatu/silnlp_data/Paratext/projects/NIV11UK/19PSAukNIV11.SFM for book PSA
2024-07-08 08:55:00,802 - silnlp.nmt.translate - ERROR - Was not able to translate PSA.
Traceback (most recent call last):
  File "/home/klaatu/silnlp/silnlp/nmt/translate.py", line 123, in translate_books
    translator.translate_book(
  File "/home/klaatu/silnlp/silnlp/common/translator.py", line 321, in translate_book
    self.translate_usfm(
  File "/home/klaatu/silnlp/silnlp/common/translator.py", line 346, in translate_usfm
    doc: List[sfm.Element] = list(usfm.parser(book_file, stylesheet=stylesheet, canonicalise_footnotes=False))
  File "/home/klaatu/silnlp/silnlp/sfm/__init__.py", line 703, in _default_
    self._error(
  File "/home/klaatu/silnlp/silnlp/sfm/__init__.py", line 599, in _error
    raise SyntaxError(msg)
SyntaxError: /home/klaatu/silnlp_data/Paratext/projects/NIV11UK/19PSAukNIV11.SFM: line 338,56: orphan end marker \fm*: no matching opening marker \fm
Traceback (most recent call last):
  File "/home/klaatu/miniconda3/envs/py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/klaatu/miniconda3/envs/py38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/klaatu/silnlp/silnlp/nmt/translate.py", line 389, in <module>
    main()
  File "/home/klaatu/silnlp/silnlp/nmt/translate.py", line 358, in main
    translator.translate_books(
  File "/home/klaatu/silnlp/silnlp/nmt/translate.py", line 140, in translate_books
    raise ValueError(f"Some books failed to translate: {' '.join(translation_failed)}")
ValueError: Some books failed to translate: PSA

Here's another translating just PSA23 using BENCLBS as a source:

2024-07-08 10:08:03,112 - silnlp.common.translate - INFO - Found the file /home/klaatu/silnlp_data/Paratext/projects/BENCLBSI/19PSABCL.SFM for book PSA
2024-07-08 10:08:03,656 - silnlp.common.translate - INFO - File /home/klaatu/silnlp_data/Paratext/projects/BENCLBSI/19PSABCL.SFM parsed correctly.
Loading checkpoint shards: 100%|████████████████████████████████████████| 2/2 [00:02<00:00,  1.46s/it]
100%|███████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  5.28ex/s]
2024-07-08 10:09:07,586 - silnlp.nmt.translate - ERROR - Was not able to translate PSA.
Traceback (most recent call last):
  File "/home/klaatu/silnlp/silnlp/nmt/translate.py", line 110, in translate_books
    translator.translate_book(
  File "/home/klaatu/silnlp/silnlp/common/translator.py", line 321, in translate_book
    self.translate_usfm(
  File "/home/klaatu/silnlp/silnlp/common/translator.py", line 396, in translate_usfm
    trg_doc = list(usfm.parser(tmp_file, canonicalise_footnotes=False))
  File "/home/klaatu/silnlp/silnlp/sfm/__init__.py", line 710, in _default_
    self._error(
  File "/home/klaatu/silnlp/silnlp/sfm/__init__.py", line 599, in _error
    raise SyntaxError(msg)
SyntaxError: /tmp/tmpjatt7irn/tmp.SFM: line 2,6: orphan marker \v: may only occur under \d, \lf, \lh, \li, \li1, \li2, \li3, \li4, \lim, \lim1, \lim2, \lim3, \lim4, \m, \mi, \nb, \p, \pc, \ph, \phi, \pi, \pi1, \pi2, \pi3, \pm, \pmc, \pmo, \pmr, \po, \pr, \q, \q1, \q2, \q3, \q4, \qc, \qd, \qm, \qm1, \qm2, \qm3, \qr, \s3, \sp, \tc1, \tc2, \tc3, \tc4, \tcr1, \tcr2, \tcr3, \tcr4, \tr
Traceback (most recent call last):
  File "/home/klaatu/miniconda3/envs/py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/klaatu/miniconda3/envs/py38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/klaatu/silnlp/silnlp/nmt/translate.py", line 389, in <module>
    main()
  File "/home/klaatu/silnlp/silnlp/nmt/translate.py", line 358, in main
    translator.translate_books(
  File "/home/klaatu/silnlp/silnlp/nmt/translate.py", line 140, in translate_books
    raise ValueError(f"Some books failed to translate: {' '.join(translation_failed)}")
ValueError: Some books failed to translate: PSA
isaac091 commented 2 weeks ago

You may be able to resolve some of the errors by using a different value for the --stylesheet-field-update option, which changes how a project's custom USFM stylesheet is handled for the "OccursUnder" and "TextProperties" fields. The possible values are "merge" (default), "ignore", and "replace". I know we've had to use "ignore" for NIV84 in the past.

ddaspit commented 2 weeks ago

We really need to switch the USFM parsing over to Machine. This is clearly becoming a higher and higher priority. @isaac091, how would you feel about taking on that task?

isaac091 commented 2 weeks ago

Sure, I would be happy to.

jcuenod commented 2 weeks ago

@isaac091, thanks I'll give --stylesheet-field-update a try. Look forward to seeing these things resolved :)

jcuenod commented 17 hours ago

Fyi, this was really useful to know. Thanks!