tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
791 stars 82 forks source link

Parsing of the Wiktionary page "seis" fails for some languages #113

Closed Vuizur closed 2 years ago

Vuizur commented 2 years ago

For some reason all data that should be in the Spanish section is instead added to Scots instead, which can be seen in the kaikki data: https://kaikki.org/dictionary/All%20languages%20combined/meaning/s/se/seis.html I looked a bit at the Wiktionary source, but couldn't find the reason.

PS: This project is really amazing 👍

tatuylonen commented 2 years ago

Parsing seems to fail because {{sco-third-person|...}} under Scots expand to invalid HTML, with unterminated <span>. I had special code in handling a subtitle in the parser that doesn't close HTML tags (probably I've seen subtitles inside HTML somewhere). I changed that code so that <span> tags are closed at subtitled. With that change, this page now seems to parse correctly. This will probably fix many (thousands?) similar errors on other pages - probably every page using that template is affected, and possibly other templates too. The fix should be reflected on https://kaikki.org in a couple of days.

Vuizur commented 2 years ago

It works great now 👍 Thanks for fixing it!