gnosygnu / xowa

xowa offline wiki application
Other
377 stars 40 forks source link

Broken templates in non-Wikimedia wikis #851

Open vgambier opened 3 years ago

vgambier commented 3 years ago

Hello,

I've noticed that when importing a non-Wikimedia MediaWiki wiki (damn that's a mouthful) using the "Build a wiki while offline" method, some HTML tags and some templates used by the wiki don't seem to work properly, and as a result, there's a lot of broken code visible when viewing the articles. I tested this on both the UESPWiki and the Muppets Fandom/Wikia.

XOWA_UESP_Morrowind UESP viewed in XOWA

XOWA_Muppet_Home Muppet Wiki viewed in XOWA

I know this project is on hiatus so I don't expect a fix, but I'm wondering - do you know what is responsible for this? Is it because the database dumps I've used are bad? Or is the issue more on XOWA's side?

Note that I've tried to import the same .xml dumps in similar software: WikiTaxi does the same thing, while BzReader has a similar but not identical problem (I think BzReader is not as "smart" so it ignores a lot of HTML - it may not be a great comparison but I'll include it anyway). So I'm guessing the issue runs a little deeper than I thought.

WikiTaxi_UESP_Morrowind UESP viewed in WikiTaxi

BzReader_UESP_Morrowind UESP viewed in BzReader

I've also noticed the following: if I download Simple Wikipedia using the "Build a wiki while online" method, the articles display just fine. And then, if I import the .xml.bz2 dump (the one from wiki\#dump\done), it works fine in XOWA but breaks in WikiTaxi and BzReader - so it seems it's neither the dump's fault nor XOWA's fault, but a combination of both in some circumstances?

XOWA_Simple_Rodney Re-imported Simple Wikipedia viewed in XOWA - no issues

WikiTaxi_Simple_Rodney Re-imported Simple Wikipedia viewed in WikiTaxi

BzReader_Simple_Rodney Re-imported Simple Wikipedia viewed in BzReader

Possible mentions of this issue online I've found:

I'm potentially interested in helping solve this issue with a pull request, but I have no idea where to start. I'm not even sure if this is unintended behavior. Please do let me know!

desb42 commented 3 years ago

Hi,

Its fascinating to see how Xowa is used. My own use is strictly with the mediawikis.

However, I have been looking at the source to Xowa for some time.

The main thing that jumps out at me is the fact that there are many 'non-standard' Parser Extension Tags and Parser Function Hooks - try looking at the bottom of https://en.uesp.net/wiki/Special:Version and https://en.wikipedia.org/wiki/Special:Version

Xowa, in essence, hard codes these Tags and Hooks - each additional Tag or Hook would require some more java code (based on what the intent of these Tags/Hooks desired)

I am unfamiliar with the other emulators you mention, but I would guess that their inability to display the pages correctly is related to the same problem.

BTW - I can see how to get hold of a dump of Muppet Wiki (via https://muppet.fandom.com/wiki/Special:Statistics) but I do not see how to do the same for UESPWiki

vgambier commented 3 years ago

Yes, I was interested in getting an offline copy of the UESPWiki! After noticing the issue, I wanted to test it using a Wikia/Fandom wiki as I noticed the XOWA documentation mentions those - I figured there was a chance they had better support.

You can find the UESP dumps here (I found it via this page). I used the latest one, uespwiki-2021-04-21-current.xml.bz2

Oh, I see! That would explain it. So for instance, at the top of the Morrowind:Morrowind article, there is the {{protection|semi|move}} template. And this template uses the tag <cleanspace>, which is used by the UESPWiki and not Wikipedia according to the Special:Version pages. Interesting.

I wonder, could a quick workaround be to ignore unknown tags?

desb42 commented 3 years ago

Another crude hack would be to edit the Templates and see if removing these tags produced viewable results

vgambier commented 3 years ago

Ok, I just tried to manually edit some templates.

My goal was to modify Template:Protection and Template:Gameinfo so that the top of the Morrowind:Morrowind article looked clean. Unfortunately, since templates use other templates, this took quite some time. All in all, I modified the following pages:

...by removing tags and hooks which included cleantable, cleanspace, define, preview, splitargs, etc.

Frustratingly, I had to replace one instance of cleantable with a linebreak (rather than just deleting it). Otherwise, the infobox would not show up as a table (you would just see the raw table code). I'm not entirely sure why and it took some time to track down. The cleantable tag in question separated the end of a table and a closing div tag:

|}</cleantable></div>

This makes me worry that an automatic deletion of all unimplemented tags/hooks would not necessarily produce correct results.

Anyway, I eventually got it working:

morrowind

Except for the optional arguments which should not produce lines - as you can see, release dates 5 through 9 are mistakenly included. I haven't looked into it. It looks as if the if hook doesn't work, but it's supposed to work on Wikipedia so I don't know.

vgambier commented 3 years ago

Oh, also, because of the way templates work on MediaWiki/XOWA, if you modify a template and go to a page that uses the template, you won't see the reflected changes immediately. To "refresh" the page, you can just edit it. This is why there's a weird number (344413) in the opening paragraph: I just made an arbitrary edit whenever I wanted to see if my template edits worked. Adding a linebreak at the very end of the article works too, and is automatically ignored, so you can do that if you want to refresh the article without modifying it.

vgambier commented 3 years ago

Xowa, in essence, hard codes these Tags and Hooks - each additional Tag or Hook would require some more java code (based on what the intent of these Tags/Hooks desired)

Could you tell me where in the code is this located? A quick search shows most (but not all - noexternallanglinks is absent, for one) function hooks are present in a big switch case in XomwParser.java, but the code is all commented out, so that can't be it.

desb42 commented 3 years ago

the files under 400_xowa\src\gplx\xowa\mediawiki are part of a rework of the parser by @gnosygnu - that is not the active code

noexternallanglinks is 'declared' in 400_xowa\src\gplx\xowa\langs\Xol_langitm.java as an internal id and as text (for matching) an inverse (given id find name) is defined in 400_xowa\src\gplx\xowa\langs\kwds\Xol_kwdgrp.java

having defined an id, 400_xowa\src\gplx\xowa\xtns\pfuncs\Pffunc.java adds the keyword for the parser and the action to take place when this keyword is correctly identified

In this case the class Wdata_pf_noExternalLangLinks is instantiated.

At 'evaluation' time the overridden function Func_evaluate is called. The code in this function then takes action appropriate for this keyword

I hope this makes some sense

desb42 commented 3 years ago

Frustratingly, I had to replace one instance of cleantable with a linebreak (rather than just deleting it). Otherwise, the infobox would not show up as a table (you would just see the raw table code). I'm not entirely sure why and it took some time to track down. The cleantable tag in question separated the end of a table and a closing div tag:

The way the current parser works is that to identify a table start it is looking for THREE characters '\n{|' (linefeed, open curly, pipe) [there is a special case for a table at the start of a page]

I suspect this is what you tripped up on

vgambier commented 3 years ago

Thank you for your help! I'll try to see if I can create new keywords. I'm not sure how I'll go about finding the right implementation, but I hope I'll figure it out. At worse, a mostly empty implementation could produce ok results, as we've seen earlier.