tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
90 stars 23 forks source link

Change external links `[...]` regex #285

Closed kristian-clausal closed 2 months ago

kristian-clausal commented 2 months ago

[test]] is not a valid link, so prevent that with negative lookahead. This caused problems in cases like:

{{quote-book|en|author=[[w:Theodore Beza|Theodore de Beza]] |tlr=[[w:Robert Fills|R[obert] F[ills]]]......

with the open [ bracket messing with the last ]] token, causing the later LINKS_RE pattern call to basically take ages.

xxyzz commented 2 months ago

More information: the original wikitext is [[w:Robert Fills|R[obert] F[ills]]] in page earnestlier and many author links in example templates are written in this format.

Don't know why they're not using [[w:Robert Fills|Robert Fills]](or tlr=w:Robert Fills) but I guess the use of ]]] is because(maybe) a MediaWiki bug: [[w:Robert Fills|R[obert] F[ills]]] can't render correctly in MediaWiki.

kristian-clausal commented 2 months ago

I think [[w: RF|...F[ills]]] would render as Robert Fills]. The ]] is eaten by the link, and the last ] is orphaned, because the [brackets] aren't parsed as external links (no url). This is expected behavior, it was ours that was wrong because our external link regex ate [ills]], breaking up the ]] token.

EDIT:

The page now parses quickly, so for this kind of formatting the bug is fixed.