tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
90 stars 23 forks source link

Can't parse link nodes contain newline character #266

Closed xxyzz closed 3 months ago

xxyzz commented 3 months ago

Page: https://en.wiktionary.org/wiki/forswat Simplified Wikitext: [[a|b\nc]] Error message: https://kaikki.org/dictionary/All%20languages%20combined/errors/details--2--is-an-alias-of--year---cannot-spec-Q6~yILHj.html

The links regex at here https://github.com/tatuylonen/wikitextprocessor/blob/cdd76b208685d2e040e03a95a9ecde8e89390c68/src/wikitextprocessor/core.py#L160

can't match the [[a|b\nc]] link, @kristian-clausal could you please take a look of the regex? I'm not dare to change this pattern...

kristian-clausal commented 3 months ago

I'll take a look at it. Regex... :sob:

xxyzz commented 3 months ago

The interesting part is when I put this regex pattern on https://regex101.com, it would match [[a|b\nc]] but not in Python code. Not sure what's going on...

kristian-clausal commented 3 months ago

Can you paste what you tested on regex101? there's a couple of ? that need to removed after the MAGICAL characters.

I tested some variations on the link syntax in the Wiktionary sandbox:

[[test|testing this 1 ok]]

[[test|testing
this 2 ok]]

[[test|

testing

this 3 ok

]]

[[test
|testing this4 fails
]]

[[
test|
testing this5 fails
]]

You can't have newlines in the [[...name....| part of the url, but otherwise you can seemingly have as many newlines in the text portion.

kristian-clausal commented 3 months ago

I just took a deeper look at the regex and remembered that I wrote this horrible, horrible thing... Oh no.

xxyzz commented 3 months ago

I removed the nowiki magic number from the pattern, it's doesn't affect the result. Here is the pattern I tested on regex101: \[\[(((?!\]\])[^[\n])*(?!\[[\n]+\])((?!\[\[)[^]\n])+)\]\], it's basically the same pattern in our code. It also works with the PHP flavor, but doesn't match when using the Python re library.

kristian-clausal commented 3 months ago

I get this result (same with the Python option):

Screenshot at 2024-04-10 09-03-44

kristian-clausal commented 3 months ago

I think I have an error (other) in the regex:

    + r"((?!\]\])[^[\n])*(?!\[[\n]+\])((?!\[\[)[^]\n])+"
    #   ( no ]] ) ( no [ ) ( no [...] )( no [[) (no ])

should probably have been

    + r"((?!\]\])[^[\n])*(?!\[[^\n]+\])((?!\[\[)[^]\n])+"
    #   ( no ]] ) ( no [ ) ( no [...] )( no [[) (no ])

But this is unrelated to the current problem...

xxyzz commented 3 months ago

I use the "[[a|b\nc]]" test text on regex101, I guess the test sting on regex101 doesn't make "\n" a new line character...

Sorry for the distraction... I though maybe the pattern works but somehow only doesn't work in Python's re library.

kristian-clausal commented 3 months ago

I think I've got something...

(?<!\[)    # negative lookbehind, [[[ breaks the link completely, the whole thing is not parsed as a link or url
\[\[      # start brackets
(
  (
    (?!\]\])    # negative lookahead, no ]] allowed
    [^[\n]
  )*    # no [ or newlines allowed
  (
     (?!\[\[)   # no [[ allowed
     [^]\n]    # no ] or newlines
  )+
)
(\|  # after a |, newlines are allowed, the below is the same as above
  (((?!\]\])[^[])*((?!\[\[)[^]])+)
)?
\]\]