tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
90 stars 23 forks source link

Add `//` to allowed URL prefixes #254

Closed kristian-clausal closed 4 months ago

kristian-clausal commented 4 months ago

// is a wiki-internal URL prefix that means 'take the url prefix (https:// or http:// usually) and replace this // with it'. But we don't handle URLs in-between like this, it doesn't make any sense for us to do so without a website that a user is accessing, so for now let's just allow // links as valid links.

Parallel change on the Wiktextract side of things in clean_values, where we start to use URL_STARTS to check for valid URLs, too.