tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
94 stars 23 forks source link

Ignore contents of whitespace-only lines #338

Closed kristian-clausal closed 1 week ago

kristian-clausal commented 1 week ago

Issue #336

A line with \s\t should not trigger a PREFORMATTED block. The easiest way to handle this is to just ignore the contents of whole lines when they're just whitespace.

The regex split that splits the lines on "(\n+)" means that no 'line' contents has newlines characters, so we need to take a look at any tokenizing regex that has \n in it; those newlines don't do anything.

The newlines are tokenized because re.split, when given a capture group, will alternate between the splittéd and splitter texts: ["text", "\n", "text", "\n\n"].

kristian-clausal commented 1 week ago

This only fixes the issue in the first post with tabs.