soerenmeier / parse-wiki-text-2

MIT No Attribution
4 stars 5 forks source link

Parser seems to get stuck in an infinite loop (or take unreasonably long) for certain random sets of Wikitext #3

Closed emmalexandria closed 7 months ago

emmalexandria commented 7 months ago

I'm currently writing an application to parse a Wikipedia dump, and so far I've identified a set of articles which seem to either cause the parser to take ridiculously long, or potentially get stuck in an infinite loop. I don't know enough about this parser or Wikitext to theorise as to the issue, but I'm hoping a set of articles with the same issue will allow a commonality to be found. The articles I've found so far are:

The configuration I'm using is the following (generated with this, but the issue was present with the default config too):

ConfigurationSource { 
category_namespaces : & ["category"] , 
extension_tags : & ["categorytree" , "ce" , "charinsert" , "chem" , "gallery" , "graph" , "hiero" , "imagemap" , "indicator" , "inputbox" , "langconvert" , "mapframe" , "maplink" , "math" , "nowiki" , "phonos" , "poem" , "pre" , "ref" , "references" , "score" , "section" , "source" , "syntaxhighlight" , "templatedata" , "templatestyles" , "timeline"] , 
file_namespaces : & ["file" , "image"] , 
link_trail : "abcdefghijklmnopqrstuvwxyz" , 
magic_words : & ["archivedtalk" , "disambig" , "expected_unconnected_page" , "expectunusedcategory" , "forcetoc" , "hiddencat" , "index" , "newsectionlink" , "nocc" , "nocontentconvert" , "noeditsection" , "nogallery" , "noglobal" , "noindex" , "nonewsectionlink" , "notalk" , "notc" , "notitleconvert" , "notoc" , "staticredirect" , "toc"] , 
protocols : & ["//" , "bitcoin:" , "ftp://" , "ftps://" , "geo:" , "git://" , "gopher://" , "http://" , "https://" , "irc://" , "ircs://" , "magnet:" , "mailto:" , "matrix:" , "mms://" , "news:" , "nntp://" , "redis://" , "sftp://" , "sip:" , "sips:" , "sms:" , "ssh://" , "svn://" , "tel:" , "telnet://" , "urn:" , "worldwind://" , "xmpp:"] , 
redirect_magic_words : & ["redirect"] , 
}