tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
94 stars 23 forks source link

Removing newlines from around HTML comments breaks some things #342

Closed kristian-clausal closed 2 hours ago

kristian-clausal commented 2 hours ago

Kaikki's been dead while I was sick due to a bunch of exceptions caused by Catalan comments on EN wiktionary:

==Catalan==

===Pronunciation===
* {{ca-IPA|ë}}<!-- per GDLC, DNV; not in DCVB but as a recent borrowing we would expect ë in Balearic -->
* {{audio|ca|LL-Q7026 (cat)-Unjoanqualsevol-euro.wav}}

The comment and newline is removed and the parser see:

* {{ca-IPA|ë}}* {{audio|ca|LL-Q7026 (cat)-Unjoanqualsevol-euro.wav}}

This messes with some silly code in /en/pronunciation.py, can be tested with euro/Catalan.

Do we need to remove newlines before comments? I'm not going to touch it yet in case you want to @xxyzz

kristian-clausal commented 2 hours ago

This was actually so trivial, it fixed the issues I had with euro/Catalan, all tests pass on wikitextprocessor and wiktextract, I'm going to merge this just to get kaikki going again.