mediawiki-utilities / python-mediawiki-utilities

A set of utilities for accessing and processing MediaWiki data.
http://pythonhosted.org/mediawiki-utilities/
MIT License
55 stars 20 forks source link

wikitext_split breaks up urls #35

Open legoktm opened 9 years ago

legoktm commented 9 years ago

Hi!

I was trying to use the persistence code to identify when a specific url was added to an article, but ran into an issue with the wiktiext_split function breaking up urls:

>>> from mw.lib.persistence.tokenization import wikitext_split
>>> wikitext_split('Something blah blah http://foobar.com')
['Something', ' ', 'blah', ' ', 'blah', ' ', 'http', ':', '/', '/', 'foobar', '.', 'com']

It would be nice if urls were special-cased and kept together.

halfak commented 9 years ago

It seems like this would be possible. We have some good options for a URL regex. See https://mathiasbynens.be/demo/url-regex

halfak commented 9 years ago

@legoktm do you know if there is a MediaWiki URL regex we can use?

legoktm commented 9 years ago

Looking through Parser::replaceExternalLinks(), it appears to use:

> var_dump($wgParser->mExtLinkBracketedRegex);
string(342) "/\[(((?i)bitcoin\:|ftp\:\/\/|ftps\:\/\/|geo\:|git\:\/\/|gopher\:\/\/|http\:\/\/|https\:\/\/|irc\:\/\/|ircs\:\/\/|magnet\:|mailto\:|mms\:\/\/|news\:|nntp\:\/\/|redis\:\/\/|sftp\:\/\/|sip\:|sips\:|sms\:|ssh\:\/\/|svn\:\/\/|tel\:|telnet\:\/\/|urn\:|worldwind\:\/\/|xmpp\:|\/\/)[^][<>"\x00-\x20\x7F\p{Zs}]+)\p{Zs}*([^\]\x00-\x08\x0a-\x1F]*?)\]/Su"
halfak commented 9 years ago

I've added the URL symbol to the wikitext split lexicon in deltas. See https://github.com/halfak/Deltas/commit/40d984d2bcceb5fc4f36b42c350c07810fe1971b

I'll need to do a follow-up change here to pull in wikitext_split from deltas.