Open vgambier opened 3 years ago
Hi,
Its fascinating to see how Xowa is used. My own use is strictly with the mediawikis.
However, I have been looking at the source to Xowa for some time.
The main thing that jumps out at me is the fact that there are many 'non-standard' Parser Extension Tags and Parser Function Hooks - try looking at the bottom of https://en.uesp.net/wiki/Special:Version and https://en.wikipedia.org/wiki/Special:Version
Xowa, in essence, hard codes these Tags and Hooks - each additional Tag or Hook would require some more java code (based on what the intent of these Tags/Hooks desired)
I am unfamiliar with the other emulators you mention, but I would guess that their inability to display the pages correctly is related to the same problem.
BTW - I can see how to get hold of a dump of Muppet Wiki (via https://muppet.fandom.com/wiki/Special:Statistics) but I do not see how to do the same for UESPWiki
Yes, I was interested in getting an offline copy of the UESPWiki! After noticing the issue, I wanted to test it using a Wikia/Fandom wiki as I noticed the XOWA documentation mentions those - I figured there was a chance they had better support.
You can find the UESP dumps here (I found it via this page). I used the latest one, uespwiki-2021-04-21-current.xml.bz2
Oh, I see! That would explain it. So for instance, at the top of the Morrowind:Morrowind article, there is the {{protection|semi|move}}
template. And this template uses the tag <cleanspace>
, which is used by the UESPWiki and not Wikipedia according to the Special:Version pages. Interesting.
I wonder, could a quick workaround be to ignore unknown tags?
Another crude hack would be to edit the Templates and see if removing these tags produced viewable results
Ok, I just tried to manually edit some templates.
My goal was to modify Template:Protection and Template:Gameinfo so that the top of the Morrowind:Morrowind article looked clean. Unfortunately, since templates use other templates, this took quite some time. All in all, I modified the following pages:
...by removing tags and hooks which included cleantable
, cleanspace
, define
, preview
, splitargs
, etc.
Frustratingly, I had to replace one instance of cleantable
with a linebreak (rather than just deleting it). Otherwise, the infobox would not show up as a table (you would just see the raw table code). I'm not entirely sure why and it took some time to track down. The cleantable
tag in question separated the end of a table and a closing div tag:
|}</cleantable></div>
This makes me worry that an automatic deletion of all unimplemented tags/hooks would not necessarily produce correct results.
Anyway, I eventually got it working:
Except for the optional arguments which should not produce lines - as you can see, release dates 5 through 9 are mistakenly included. I haven't looked into it. It looks as if the if
hook doesn't work, but it's supposed to work on Wikipedia so I don't know.
Oh, also, because of the way templates work on MediaWiki/XOWA, if you modify a template and go to a page that uses the template, you won't see the reflected changes immediately. To "refresh" the page, you can just edit it. This is why there's a weird number (344413) in the opening paragraph: I just made an arbitrary edit whenever I wanted to see if my template edits worked. Adding a linebreak at the very end of the article works too, and is automatically ignored, so you can do that if you want to refresh the article without modifying it.
Xowa, in essence, hard codes these Tags and Hooks - each additional Tag or Hook would require some more java code (based on what the intent of these Tags/Hooks desired)
Could you tell me where in the code is this located? A quick search shows most (but not all - noexternallanglinks is absent, for one) function hooks are present in a big switch case in XomwParser.java
, but the code is all commented out, so that can't be it.
the files under 400_xowa\src\gplx\xowa\mediawiki are part of a rework of the parser by @gnosygnu - that is not the active code
noexternallanglinks
is 'declared' in 400_xowa\src\gplx\xowa\langs\Xol_langitm.java
as an internal id and as text (for matching)
an inverse (given id find name) is defined in 400_xowa\src\gplx\xowa\langs\kwds\Xol_kwdgrp.java
having defined an id, 400_xowa\src\gplx\xowa\xtns\pfuncs\Pffunc.java adds the keyword for the parser and the action to take place when this keyword is correctly identified
In this case the class Wdata_pf_noExternalLangLinks is instantiated.
At 'evaluation' time the overridden function Func_evaluate
is called. The code in this function then takes action appropriate for this keyword
I hope this makes some sense
Frustratingly, I had to replace one instance of
cleantable
with a linebreak (rather than just deleting it). Otherwise, the infobox would not show up as a table (you would just see the raw table code). I'm not entirely sure why and it took some time to track down. Thecleantable
tag in question separated the end of a table and a closing div tag:
The way the current parser works is that to identify a table start it is looking for THREE characters '\n{|' (linefeed, open curly, pipe) [there is a special case for a table at the start of a page]
I suspect this is what you tripped up on
Thank you for your help! I'll try to see if I can create new keywords. I'm not sure how I'll go about finding the right implementation, but I hope I'll figure it out. At worse, a mostly empty implementation could produce ok results, as we've seen earlier.
Hello,
I've noticed that when importing a non-Wikimedia MediaWiki wiki (damn that's a mouthful) using the "Build a wiki while offline" method, some HTML tags and some templates used by the wiki don't seem to work properly, and as a result, there's a lot of broken code visible when viewing the articles. I tested this on both the UESPWiki and the Muppets Fandom/Wikia.
UESP viewed in XOWA
Muppet Wiki viewed in XOWA
I know this project is on hiatus so I don't expect a fix, but I'm wondering - do you know what is responsible for this? Is it because the database dumps I've used are bad? Or is the issue more on XOWA's side?
Note that I've tried to import the same .xml dumps in similar software: WikiTaxi does the same thing, while BzReader has a similar but not identical problem (I think BzReader is not as "smart" so it ignores a lot of HTML - it may not be a great comparison but I'll include it anyway). So I'm guessing the issue runs a little deeper than I thought.
UESP viewed in WikiTaxi
UESP viewed in BzReader
I've also noticed the following: if I download Simple Wikipedia using the "Build a wiki while online" method, the articles display just fine. And then, if I import the .xml.bz2 dump (the one from
wiki\#dump\done
), it works fine in XOWA but breaks in WikiTaxi and BzReader - so it seems it's neither the dump's fault nor XOWA's fault, but a combination of both in some circumstances?Re-imported Simple Wikipedia viewed in XOWA - no issues
Re-imported Simple Wikipedia viewed in WikiTaxi
Re-imported Simple Wikipedia viewed in BzReader
Possible mentions of this issue online I've found:
I'm potentially interested in helping solve this issue with a pull request, but I have no idea where to start. I'm not even sure if this is unintended behavior. Please do let me know!