gnosygnu / xowa

xowa offline wiki application
Other
375 stars 41 forks source link

Parser: Handle invalid wikitext such as `<xml>[[</xml>]]` (possible parsing problem) #740

Open desb42 opened 4 years ago

desb42 commented 4 years ago

looking at de.wikisource.org/wiki/Vorlage:Hauptseite_Box_Aktuell (which is part of the Main page in dewikisource (data from 2020-05-01) gives: mainpage1 Note the double square brackets [[ .. ]]

This ought to have been processed In case this changes the actual wikitext is

{{BRU|Keynote address by Sue Gardner, Wikimania 2013 2.JPG|center|400|<big>'''Herzlichen Glückwunsch zur Fertigstellung der [[Allgemeine Deutsche Biographie|ADB!'''</big>]]||center}}

This is actually, syntactically, incorrect If this is 'corrected' by moving the '''</big> to the other side of the ]], the anticipated behaviour occurs. mainpage2 How slavish should xowa be to incorrect wikitext? My inclination is to go edit the wikitext on the mediawiki site

gnosygnu commented 4 years ago

Thanks as always for the detail. If it helps any, screenshots aren't necessary as your breakdown is more than helpful (just trying to save you any possible work).

How slavish should xowa be to incorrect wikitext?

Yeah, XOWA tries to handle incorrect wikitext, but the XOWA parser is brittle, especially around templates, but also with XML nodes. I haven't looked at the code, but in this case, I'm guessing XOWA gives priority to closing XML tags (pulling the </big> tag) before trying any corrective action. The XML priority is needed to handle "extension" tags like <ref>, <poem> , etc. which are like their own "mini-DOM"

My inclination is to go edit the wikitext on the mediawiki site

If it's a one-off, then that's probably best. If you're seeing this often (like it's generated by a Template / Module), then I'll look at a longer-term fix


Also, just sharing some other background

I'm working on version 3 of the XOWA parser

I'm not sure how Version 3 will turn out as it's ambitious in scope (one-step transpiling of MediaWiki PHP code to Java). I've manually transpiled enough PHP code in Version 2, that I think this is feasible, but it could be a very deep rabbit-hole. I'll know by the end of this month what a possible timeline is.

I'm bringing this up b/c at this point you're actually an (if not "the") authority on all the bugs in the Version 1 parser. Depending on the bug's severity, please feel free to bump / nudge, and I'll prioritize.

My prioritization order has been:

Feel free to add other guidelines above, or just let me know if there's a specific issue that needs fixing.

Thanks!