tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
90 stars 23 forks source link

Change to link detection regex #267

Closed kristian-clausal closed 3 months ago

kristian-clausal commented 3 months ago

Should fix #266

The previous version did not allow for links with newlines in the text portion of the link:

[[this link|

should be

accepted]]

Newlines are not allowed in the link data portion ('this link').

Because of the negative lookahead stuff to detect [[, ]] and [, ] pairs inside the links, it's a real monster of a regex that's hard to read.

I also found a minor bug that meant a part of the regex was basically disabled, so I just removed it (no ^ to negate \n in [\n]).

Also added a negative lookbehind: a link starting with [[[ is parsed as text.

xxyzz commented 3 months ago

This feels like a parser would do, look around neighbor tokens to decide how to parse tokens. I wonder which is more painful, use a better parser(maybe mwparserfromhell?) or add more regex...