tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
90 stars 23 forks source link

Determine a HTML tag is self-closing if it ends with "/>" #252

Closed xxyzz closed 4 months ago

xxyzz commented 4 months ago

Fixes tatuylonen/wiktextract#535

It works for now, but we'll have more errors with this regex in the future, only using a real HTML parser could fix them.

kristian-clausal commented 4 months ago

This seems like a slam dunk.

xxyzz commented 4 months ago

Probably negligible for simple patterns. Speaking of performance, extract fr edition time drops back to 40 minutes, maybe some commits after #238 improved the speed.