giellatekno / neahttadigisanit

Saami dictionary webapp
Other
2 stars 2 forks source link

XML parsing error with empty t-nodes #41

Open trondtynnol opened 1 month ago

trondtynnol commented 1 month ago

In sanj, when searching for варежка, at least the first time after a restart, NDS is unable to parse the entry and gives an empty result:

Screenshot 2024-10-03 at 08-52-42 Neahttadigisánit - саамский словарь

When running this locally, the following error was produced:

127.0.0.1 - - [03/Oct/2024 08:39:11] "GET /autocomplete/rus/sjd/?lookup=варежка HTTP/1.1" 200 -
варежка+N+Fem+Inan+Sg+Nom
варежка+N+Fem+Inan+Sg+Nom
[2024-10-03 08:39:12,455] ERROR in formatters: Potential XML formatting problem somewhere in... 

b'<e>\n    <lg>\n      <l pos="N" pos_txt="\xd1\x81\xd1\x83\xd1\x89.">\xd0\xb2\xd0\xb0\xd1\x80\xd0\xb5\xd0\xb6\xd0\xba\xd0\xb0</l>\n    </lg>\n    <mg>\n      <tg xml:lang="sjd">\n        <t audio="\xd0\xb2\xd0\xbe\xd0\xb0\xcc\x84\xd1\x85\xd1\x85\xd1\x86.flac">\xd0\xb2\xd0\xbe\xd0\xb0\xcc\x84\xd1\x85\xd1\x85\xd1\x86</t>\n        <xg>\n          <x xml:lang="rus">\xd0\xb2\xd0\xb0\xd1\x80\xd0\xb5\xd0\xb6\xd0\xba\xd0\xb8</x>\n          <xt xml:lang="sjd" audio="\xd0\xb2\xd0\xbe\xd0\xb0\xcc\x84\xd1\x86.flac">\xd0\xb2\xd0\xbe\xd0\xb0\xcc\x84\xd1\x86</xt>\n        </xg>\n        <xg>\n          <x xml:lang="rus">\xd0\xb2\xd0\xb0\xd1\x80\xd0\xb5\xd0\xb6\xd0\xba\xd0\xb8 \xd0\xbe\xd0\xb4\xd0\xb5\xd0\xbd\xd1\x8c</x>\n          <xt xml:lang="sjd">\xd0\xb2\xd0\xbe\xd0\xb0\xcc\x84\xd1\x86\xd1\x8d\xd1\x82\xd2\x8d \xd1\x86\xd0\xb0\xcc\x84\xd0\xb3\xd1\x8c</xt>\n        </xg>\n        <xg>\n          <x xml:lang="rus">\xd1\x81\xd1\x83\xd0\xbd\xd1\x8c \xd1\x80\xd1\x83\xd0\xba\xd1\x83 \xd0\xb2 \xd0\xb2\xd0\xb0\xd1\x80\xd0\xb5\xd0\xb6\xd0\xba\xd1\x83</x>\n          <xt xml:lang="sjd">\xd0\xbd\xd0\xb0\xd0\xb3\xd0\xba\xd0\xb5\xd1\x82\xd2\x8d \xd0\xba\xd3\xa3\xd0\xb4 \xd0\xb2\xd0\xbe\xd0\xb0\xcc\x84\xd1\x85\xd1\x85\xd1\x8c\xd1\x86\xd1\x8d (\xd0\xb2\xd0\xbe\xd0\xb0\xcc\x84\xd1\x85\xd1\x85\xd1\x86\xd0\xb0)</xt>\n        </xg>\n        <xg>\n          <x xml:lang="rus">\xd0\xbe\xd0\xbd\xd0\xb0 \xd0\xbf\xd0\xbe\xd1\x82\xd0\xb5\xd1\x80\xd1\x8f\xd0\xbb\xd0\xb0 \xd0\xbe\xd0\xb4\xd0\xbd\xd1\x83 \xd0\xb2\xd0\xb0\xd1\x80\xd0\xb5\xd0\xb6\xd0\xba\xd1\x83, \xd1\x82\xd0\xb5\xd0\xbf\xd0\xb5\xd1\x80\xd1\x8c \xd1\x83 \xd0\xbd\xd0\xb5\xd1\x91 \xd1\x82\xd0\xbe\xd0\xbb\xd1\x8c\xd0\xba\xd0\xbe \xd0\xbe\xd0\xb4\xd0\xbd\xd0\xb0 \xd0\xb2\xd0\xb0\xd1\x80\xd0\xb5\xd0\xb6\xd0\xba\xd0\xb0</x>\n          <xt xml:lang="sjd">\xd1\x81\xd0\xbe\xcc\x84\xd0\xbd\xd0\xbd \xd0\xba\xd0\xb0\xcc\x84\xd0\xb4\xd1\x8d\xd1\x85\xd1\x8c\xd1\x82 \xd1\x8d\xd1\x84\xd1\x82 \xd0\xb2\xd0\xbe\xd0\xb0\xcc\x84\xd1\x86, \xd0\xb0\xd0\xb4\xd1\x82\xd2\x8d \xd1\x81\xd0\xbe\xd1\x81\xd1\x82 \xd0\xbb\xd1\x8b\xd1\x88\xd1\x88\xd1\x8d \xd0\xbb\xd3\xa3 \xd1\x8d\xcc\x84\xd1\x85\xd1\x85\xd1\x82 \xd0\xb2\xd0\xbe\xd0\xb0\xcc\x84\xd1\x85\xd1\x85\xd1\x86</xt>\n        </xg>\n        <xg>\n          <x xml:lang="rus">\xd1\x8f \xd0\xb2\xd1\x8f\xd0\xb6\xd1\x83 \xd0\xb2\xd0\xb0\xd1\x80\xd0\xb5\xd0\xb6\xd0\xba\xd0\xb8 \xd1\x81 \xd1\x81\xd0\xb0\xd0\xb0\xd0\xbc\xd1\x81\xd0\xba\xd0\xb8\xd0\xbc\xd0\xb8 \xd1\x83\xd0\xb7\xd0\xbe\xd1\x80\xd0\xb0\xd0\xbc\xd0\xb8</x>\n          <xt xml:lang="sjd">\xd0\xbc\xd1\x83\xd0\xbd\xd0\xbd \xd0\xba\xd0\xbe\xd0\xb0\xd0\xb4\xd0\xb0 \xd1\x81\xd0\xb0\xcc\x84\xd0\xbc\xd1\x8c \xd0\xba\xd1\x8b\xd1\x80\xd1\x80\xd1\x8c\xd0\xb9 \xd0\xb2\xd0\xbe\xd0\xb0\xcc\x84\xd1\x86\xd1\x8d\xd1\x82\xd2\x8d</xt>\n        </xg>\n        <xg>\n          <x xml:lang="rus">\xd0\xbc\xd0\xb0\xd0\xbc\xd0\xb0 \xd0\xbf\xd1\x80\xd1\x8f\xd0\xb6\xd1\x91\xd1\x82 \xd0\xbf\xd1\x80\xd1\x8f\xd0\xb6\xd1\x83 \xd0\xbd\xd0\xb0 \xd0\xb2\xd0\xb0\xd1\x80\xd0\xb5\xd0\xb6\xd0\xba\xd0\xb8 (\xd0\xbf\xd1\x80\xd0\xb5\xd0\xb2\xd1\x80\xd0\xb0\xd1\x89\xd0\xb0\xd0\xb5\xd1\x82 \xd0\xbf\xd1\x80\xd1\x8f\xd0\xb6\xd1\x83 \xd0\xb2 \xd0\xb2\xd0\xb0\xd1\x80\xd0\xb5\xd0\xb6\xd0\xba\xd0\xb8)</x>\n          <xt xml:lang="sjd">\xd1\x8f\xcc\x84\xd0\xbd\xd0\xbd\xd0\xb0 \xd0\xbf\xd0\xbe\xd0\xb0\xd0\xbd\xd0\xbd \xd1\x83\xd0\xbb\xd0\xbb\xd1\x8d\xd1\x82\xd2\x8d \xd0\xb2\xd0\xbe\xd0\xb0\xcc\x84\xd1\x85\xd1\x85\xd1\x86\xd1\x8d\xd0\xbd\xd2\x8d</xt>\n        </xg>\n      </tg>\n    </mg>\n    <mg>\n      <tg xml:lang="sjd">\n        <t>word_not _yet_translated</t>\n      </tg>\n    </mg>\n  </e>'

Traceback (most recent call last):
  File "/home/trond/gt/github/giellatekno/neahttadigisanit/neahtta/neahtta/nds_lexicon/formatters.py", line 176, in __iter__
    yield self.clean(node)
  File "/home/trond/gt/github/giellatekno/neahttadigisanit/neahtta/neahtta/nds_lexicon/formatters.py", line 383, in clean
    _right = list(map(lambda tg: self.clean_tg_node(e, tg), tgs))
  File "/home/trond/gt/github/giellatekno/neahttadigisanit/neahtta/neahtta/nds_lexicon/formatters.py", line 383, in <lambda>
    _right = list(map(lambda tg: self.clean_tg_node(e, tg), tgs))
  File "/home/trond/gt/github/giellatekno/neahttadigisanit/neahtta/neahtta/nds_lexicon/formatters.py", line 286, in clean_tg_node
    "Potential XML formatting problem while processing <tg /> nodes.\n\n"
TypeError: can only concatenate str (not "bytes") to str

()

{'target_lang': 'sjd', 'source_lang': 'rus', 'ui_lang': 'sjd', 'user_input': 'варежка'}
<template>:264: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.
127.0.0.1 - - [03/Oct/2024 08:39:12] "GET /rus/sjd/?lookup=варежка&search= HTTP/1.1" 200 -

Which is a bit hard to read with all the encoded cyrillic, se here it is decoded:

127.0.0.1 - - [03/Oct/2024 08:39:11] "GET /autocomplete/rus/sjd/?lookup=варежка HTTP/1.1" 200 -
варежка+N+Fem+Inan+Sg+Nom
варежка+N+Fem+Inan+Sg+Nom
[2024-10-03 08:39:12,455] ERROR in formatters: Potential XML formatting problem somewhere in... 

b'<e>\n    <lg>\n      <l pos="N" pos_txt="сущ.">варежка</l>\n    </lg>\n    <mg>\n      <tg xml:lang="sjd">\n        <t audio="воа̄ххц.flac">воа̄ххц</t>\n        <xg>\n          <x xml:lang="rus">варежки</x>\n          <xt xml:lang="sjd" audio="воа̄ц.flac">воа̄ц</xt>\n        </xg>\n        <xg>\n          <x xml:lang="rus">воа̄ц</x>\n          <xt xml:lang="sjd">воа̄цэтҍ ца̄гь</xt>\n        </xg>\n        <xg>\n          <x xml:lang="rus">сунь руку в варежку</x>\n          <xt xml:lang="sjd">нагкетҍ кӣд воа̄ххьцэ (воа̄ххца)</xt>\n        </xg>\n        <xg>\n          <x xml:lang="rus">она потеряла одну варежку, теперь у неё только одна варежка</x>\n          <xt xml:lang="sjd">со̄нн ка̄дэхьт эфт воа̄ц, адтҍ сост лышшэ лӣ э̄ххт воа̄ххц</xt>\n        </xg>\n        <xg>\n          <x xml:lang="rus">я вяжу варежки с саамскими узорами</x>\n          <xt xml:lang="sjd">мунн коада са̄мь кыррьй воа̄цэтҍ</xt>\n        </xg>\n        <xg>\n          <x xml:lang="rus">мама пряжёт пряжу на варежки (превращает пряжу в варежки)</x>\n          <xt xml:lang="sjd">я̄нна поанн уллэтҍ воа̄ххцэнҍ</xt>\n        </xg>\n      </tg>\n    </mg>\n    <mg>\n      <tg xml:lang="sjd">\n        <t>word_not _yet_translated</t>\n      </tg>\n    </mg>\n  </e>'

Traceback (most recent call last):
  File "/home/trond/gt/github/giellatekno/neahttadigisanit/neahtta/neahtta/nds_lexicon/formatters.py", line 176, in __iter__
    yield self.clean(node)
  File "/home/trond/gt/github/giellatekno/neahttadigisanit/neahtta/neahtta/nds_lexicon/formatters.py", line 383, in clean
    _right = list(map(lambda tg: self.clean_tg_node(e, tg), tgs))
  File "/home/trond/gt/github/giellatekno/neahttadigisanit/neahtta/neahtta/nds_lexicon/formatters.py", line 383, in <lambda>
    _right = list(map(lambda tg: self.clean_tg_node(e, tg), tgs))
  File "/home/trond/gt/github/giellatekno/neahttadigisanit/neahtta/neahtta/nds_lexicon/formatters.py", line 286, in clean_tg_node
    "Potential XML formatting problem while processing <tg /> nodes.\n\n"
TypeError: can only concatenate str (not "bytes") to str

()

{'target_lang': 'sjd', 'source_lang': 'rus', 'ui_lang': 'sjd', 'user_input': 'варежка'}
<template>:264: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.
127.0.0.1 - - [03/Oct/2024 08:39:12] "GET /rus/sjd/?lookup=варежка&search= HTTP/1.1" 200 -

I have a hunch that the cause is the empty t-node in this entry (probably an error in the conversion script, but NDS should in principle handle empty t-nodes):

<e>
    <lg>
      <l pos="N" pos_txt="сущ.">варежка</l>
    </lg>
    <mg>
      <tg xml:lang="sjd">
        <t audio="воа̄ххц.flac">воа̄ххц</t>
        <xg>
          <x xml:lang="rus">варежки</x>
          <xt xml:lang="sjd" audio="воа̄ц.flac">воа̄ц</xt>
        </xg>
        <xg>
          <x xml:lang="rus">варежки одень</x>
          <xt xml:lang="sjd">воа̄цэтҍ ца̄гь</xt>
        </xg>
        <xg>
          <x xml:lang="rus">сунь руку в варежку</x>
          <xt xml:lang="sjd">нагкетҍ кӣд воа̄ххьцэ (воа̄ххца)</xt>
        </xg>
        <xg>
          <x xml:lang="rus">она потеряла одну варежку, теперь у неё только одна варежка</x>
          <xt xml:lang="sjd">со̄нн ка̄дэхьт эфт воа̄ц, адтҍ сост лышшэ лӣ э̄ххт воа̄ххц</xt>
        </xg>
        <xg>
          <x xml:lang="rus">я вяжу варежки с саамскими узорами</x>
          <xt xml:lang="sjd">мунн коада са̄мь кыррьй воа̄цэтҍ</xt>
        </xg>
        <xg>
          <x xml:lang="rus">мама пряжёт пряжу на варежки (превращает пряжу в варежки)</x>
          <xt xml:lang="sjd">я̄нна поанн уллэтҍ воа̄ххцэнҍ</xt>
        </xg>
      </tg>
    </mg>
    <mg>
      <tg xml:lang="sjd">
        <t></t>
      </tg>
    </mg>
  </e>

On subsequent searches, however, NDS manages to do this search and produces the expected result:

Screenshot 2024-10-03 at 09-01-30 Neahttadigisánit - саамский словарь

trondtynnol commented 1 month ago

I should note I have fixed the conversion error that caused this XML for rus-sjd, so it is no longer a problem there. I still do suspect it may be something that we should look into if we have the time, but it is not high priority.

Phaqui commented 1 month ago

sanj.gtdict-02.uit.no (the new server) looked like your screenshot with the empty 2. for me. sanj.oahpa.no looked good. Locally, it also looks good.

The giella-core/dicts/scripts/merge_giella_dicts.py script just merges the <e> elements, without doing any other checks. The nds compile project command just runs merge_giella_dicts.

I am able to reproduce the error if I insert an <e> in a dictionary, that contains nothing. I get a blank page in NDS, and see this error.

Phaqui commented 1 month ago

The error actually stems from the debugging code. There is a line etree.tostring(e, pretty_print=True, encoding="utf-8"), which is not a python str, but a bytes. Hence the error about not being able to concatenate bytes to strings.

Fixing the issue (by doing a .decode("utf-8") to turn it into a string), makes the code work as intended. The error message is still printed about something being wrong in the .xml, which is okay - it really is missing text in the node.

It looks like this:

screenshot-20241003-120131

I don't really know if that is preferable to just showing a blank screen, or blank entry... This should have been fixed by the dictionary author(s).

Phaqui commented 1 month ago

Commit 4a90e6d3cf0a91d1623f35316fc272748b6dcf27 fixes the issue in the error reporting, but does not address how empty <t> nodes are displayed in any way. Again, I think this should be on the dictionary authors.

trondtynnol commented 1 month ago

sanj.gtdict-02.uit.no (the new server) looked like your screenshot with the empty 2. for me. sanj.oahpa.no looked good. Locally, it also looks good.

Yeah, I've fixed the source file for rus-sjd and updated on the old server, so that's why it's working now.

The giella-core/dicts/scripts/merge_giella_dicts.py script just merges the <e> elements, without doing any other checks. The nds compile project command just runs merge_giella_dicts.

Yeah, this was a custom script (xlsx to xml), so that was the source of this

trondtynnol commented 1 month ago

The error actually stems from the debugging code. There is a line etree.tostring(e, pretty_print=True, encoding="utf-8"), which is not a python str, but a bytes. Hence the error about not being able to concatenate bytes to strings.

Fixing the issue (by doing a .decode("utf-8") to turn it into a string), makes the code work as intended. The error message is still printed about something being wrong in the .xml, which is okay - it really is missing text in the node.

That sounds great.

I don't really know if that is preferable to just showing a blank screen, or blank entry... This should have been fixed by the dictionary author(s).

That would of course be preferable, but we have so many dictionaries with various small mistakes and no maintainer, so it would be best if NDS accepts minor "mistakes". Having an empty t-node is also sometimes needed if you want to display a paradigm for a word for which there is yet no translation due to the way NDS works.

trondtynnol commented 1 month ago

Seems that this could be closed now if you agree, @Phaqui