Open desb42 opened 5 years ago
Ugh... This sounds like a parser issue. Is there an existing page where you're seeing this error? (I just want to get an idea of how widespread this issue is)
As for the actual fix, I'll have to look at the nowiki implementation. This is a particularly complicated piece of code that I wrote early in the XOWA parser implementation. It's possible that either my impersonation wasn't good enough, or MediaWiki changed something recently.
I'll look again at the code later this week, but depending on how widespread the above is, this may be my highest priority.
Thanks!
I stumbled across it when looking at Template:Reflist/doc - that is, I found it when looking at the documentation to Template:Reflist I cannot tell how widespread it is; however I suspect it is an edge case
I have just scanned all the enwiki html databases (18 of them) and the only one with \<nowiki> in it seems to be 1965–66_TSV_1860_Munich_season
Thanks for the follow-up.
I found the issue. It's related to the <tag> function. The simplified example wikitext would be the following:
{{#tag:pre|<nowiki>A<b>B</b></nowiki>}}
... which outputs nowiki tags
This behavior is caused by the tag function wrapping the original contents in a UNIQ block and unwrapping later. I have to look at MediaWiki code later to see what is the proper fix. A sloppy proof of concept hack would be to make the following change to https://github.com/gnosygnu/xowa/blob/master/400_xowa/src/gplx/xowa/xtns/pfuncs/strings/Pfunc_tag.java#L47
if (args_len > 0) { // handle no args; EX: "{{#tag:ref}}" -> "<ref></ref>"
byte[] temp = Pf_func_.Eval_arg_or_empty(ctx, src, caller, self, args_len, 0);
temp = ctx.Wiki().Parser_mgr().Main().Parse_text_to_html(Xop_ctx.New__sub__reuse_page(ctx), temp);
tmp_bfr.Add(temp);
}
However, this won't work on a permanent basis b/c the Main() parser should not be invoked in nested calls
I'll comment again here when I have a more robust fix.
On another note, how do you scan the html databases? I assume you have some adhoc code that un-hzips each html page and then scans the full-text? If so, how long does that take? I'd imagine it would take at least 2+ hours for each scan (unless you're saving the un-hzipped content as files somewhere)
To scan the html, I have a simple python script that does essentially as you describe
see the gist checkhtml.py
On the machine I use this takes about 30 mins This produces 6059 entries
Cool. This should pick up most of the errors, since they aren't hzipped.
I'll give the python script a try when I get home later. It's interesting that your script is relatively concise yet powerful. One day, when I get rid of hzip, it'll be pretty useful in scanning through all the html pages
I have just found an instance of this \
<th scope="row" class="navbox-group" style="background: white;
-moz-box-shadow: inset 2px 2px 0 <nowiki>#F0001C</nowiki>, inset -2px -2px 0 <nowiki>#F0001C</nowiki>;
-webkit-box-shadow: inset 2px 2px 0 <nowiki>#F0001C</nowiki>, inset -2px -2px 0 <nowiki>#F0001C</nowiki>;
box-shadow: inset 2px 2px 0 <nowiki>#F0001C</nowiki>, inset -2px -2px 0 <nowiki>#F0001C</nowiki>;;width:1%">
Note the presence of many \
Looking at the wikitext the area under discussion is {{Party of European Socialists}}
This in turn contains three {{Party of European Socialists/meta/color}} entries
And that template contains the \
I think it needs a little boost in priority
Thanks for the example. Will take a look at it this weekend, but nowiki debugging always gives me a headache.
And here's another \
The wikitext off this section is:
== CFI and vandalism ==
Now this is a section CFI could do well without:
<div style="border-left: 1px solid #C00; border-left-width: 3px; padding-left: .5em; margin-left: 2em;">
<nowiki>==Vandalism==</nowiki>
From time to time, various parties will insert material into Wiktionary which clearly has nothing
Xowa is treating the ==Vandalism== as a header, mediawiki just as text
I thought I would take a look at this and have noticed quote a lot of commented out code regarding UNIQ So I reinstated them to see what happens
The example I was specifically tracking down was en.wikipedia.org/wiki/Template:Party of European Socialists/meta/color
It does seem to work with the current code (this is due to the nowiki text being 'esacpaed')
I tracked things to Xop_tblw_wkr.java Atrs_make
This routine essentially finds all the tokens associated with the attributes to the table element, works out where they start and end and then throws them away.
For \
Instead, I took the tokens identified and effectively passed them through Xot_tmpl_wtr.Write
This seemed to work in the short term
However, I believe there is an underlying issue with the table tokens - they all assume that they refer to the original source Using the above approach I think the object prv_tblw should not only be adjusted for range but also for the potentially new and different sized source (Or am I just rambling)
I thought I would take a look at this and have noticed quote a lot of commented out code regarding UNIQ So I reinstated them to see what happens
Yeah, I added this a while ago. I forget why I left it commented (probably did not want to risk changing behavior)
Let me put it on tab for this weekend. Thanks.
The following wikitext
In the wikipedia sandbox produces
Xowa produces
Ignoring the red error, note the presence of the text '\<nowiki>' and '\</nowiki>' in the left hand column
The handling of \<nowiki> does not seem quite correct