<nowiki> appearing where it should not

desb42 commented 5 years ago

The following wikitext

{{markup|title=Using only footnote-style references
|<nowiki>Lorem ipsum.<ref>Source name, access date, etc.</ref>

Lorem ipsum dolor sit amet.<ref>Source name, access date, etc.</ref>

==References==
{{Reflist}}</nowiki>
|Lorem ipsum.<ref>Source name, access date, etc.</ref>
Lorem ipsum dolor sit amet.<ref>Source name, access date, etc.</ref>

{{fake heading|sub=3|References}}
{{Reflist}}
}}

In the wikipedia sandbox produces nowiki_true

Xowa produces nowiki_bad

Ignoring the red error, note the presence of the text '\<nowiki>' and '\</nowiki>' in the left hand column

The handling of \<nowiki> does not seem quite correct

gnosygnu commented 5 years ago

Ugh... This sounds like a parser issue. Is there an existing page where you're seeing this error? (I just want to get an idea of how widespread this issue is)

As for the actual fix, I'll have to look at the nowiki implementation. This is a particularly complicated piece of code that I wrote early in the XOWA parser implementation. It's possible that either my impersonation wasn't good enough, or MediaWiki changed something recently.

I'll look again at the code later this week, but depending on how widespread the above is, this may be my highest priority.

Thanks!

desb42 commented 5 years ago

I stumbled across it when looking at Template:Reflist/doc - that is, I found it when looking at the documentation to Template:Reflist I cannot tell how widespread it is; however I suspect it is an edge case

desb42 commented 5 years ago

I have just scanned all the enwiki html databases (18 of them) and the only one with \<nowiki> in it seems to be 1965–66_TSV_1860_Munich_season

gnosygnu commented 5 years ago

Thanks for the follow-up.

I found the issue. It's related to the <tag> function. The simplified example wikitext would be the following:

{{#tag:pre|<nowiki>A<b>B</b></nowiki>}}

... which outputs nowiki tags

This behavior is caused by the tag function wrapping the original contents in a UNIQ block and unwrapping later. I have to look at MediaWiki code later to see what is the proper fix. A sloppy proof of concept hack would be to make the following change to https://github.com/gnosygnu/xowa/blob/master/400_xowa/src/gplx/xowa/xtns/pfuncs/strings/Pfunc_tag.java#L47

if (args_len > 0) { // handle no args; EX: "{{#tag:ref}}" -> "<ref></ref>"
    byte[] temp = Pf_func_.Eval_arg_or_empty(ctx, src, caller, self, args_len, 0);
    temp = ctx.Wiki().Parser_mgr().Main().Parse_text_to_html(Xop_ctx.New__sub__reuse_page(ctx), temp);
    tmp_bfr.Add(temp);
}

However, this won't work on a permanent basis b/c the Main() parser should not be invoked in nested calls

I'll comment again here when I have a more robust fix.

On another note, how do you scan the html databases? I assume you have some adhoc code that un-hzips each html page and then scans the full-text? If so, how long does that take? I'd imagine it would take at least 2+ hours for each scan (unless you're saving the un-hzipped content as files somewhere)

desb42 commented 5 years ago

To scan the html, I have a simple python script that does essentially as you describe

see the gist checkhtml.py

On the machine I use this takes about 30 mins This produces 6059 entries

gnosygnu commented 5 years ago

Cool. This should pick up most of the errors, since they aren't hzipped.

I'll give the python script a try when I get home later. It's interesting that your script is relatively concise yet powerful. One day, when I get rid of hzip, it'll be pretty useful in scanning through all the html pages

desb42 commented 5 years ago

I have just found an instance of this \ issue which has broader consequences redness Within the source of the page is the following lines

<th scope="row" class="navbox-group" style="background: white; 
-moz-box-shadow: inset 2px 2px 0 <nowiki>#F0001C</nowiki>, inset -2px -2px 0 <nowiki>#F0001C</nowiki>; 
-webkit-box-shadow: inset 2px 2px 0 <nowiki>#F0001C</nowiki>, inset -2px -2px 0 <nowiki>#F0001C</nowiki>; 
box-shadow: inset 2px 2px 0 <nowiki>#F0001C</nowiki>, inset -2px -2px 0 <nowiki>#F0001C</nowiki>;;width:1%">

Note the presence of many \

Looking at the wikitext the area under discussion is {{Party of European Socialists}} This in turn contains three {{Party of European Socialists/meta/color}} entries And that template contains the \ entry

I think it needs a little boost in priority

gnosygnu commented 5 years ago

Thanks for the example. Will take a look at it this weekend, but nowiki debugging always gives me a headache.

desb42 commented 5 years ago

And here's another \ the other way around. That is \ tags do not seem to be taken en.wiktionary.org/wiki/Wiktionary:Beer_parlour/2011/March#CFI_and_vandalism

The wikitext off this section is:

== CFI and vandalism ==

Now this is a section CFI could do well without:

<div style="border-left: 1px solid #C00; border-left-width: 3px; padding-left: .5em; margin-left: 2em;">
<nowiki>==Vandalism==</nowiki>

From time to time, various parties will insert material into Wiktionary which clearly has nothing

Xowa is treating the ==Vandalism== as a header, mediawiki just as text

desb42 commented 5 years ago

I thought I would take a look at this and have noticed quote a lot of commented out code regarding UNIQ So I reinstated them to see what happens

The example I was specifically tracking down was en.wikipedia.org/wiki/Template:Party of European Socialists/meta/color

It does seem to work with the current code (this is due to the nowiki text being 'esacpaed')

I tracked things to Xop_tblw_wkr.java Atrs_make This routine essentially finds all the tokens associated with the attributes to the table element, works out where they start and end and then throws them away. For \, there is piece of commented out code to use Uniq_mgr

Instead, I took the tokens identified and effectively passed them through Xot_tmpl_wtr.Write This seemed to work in the short term

However, I believe there is an underlying issue with the table tokens - they all assume that they refer to the original source Using the above approach I think the object prv_tblw should not only be adjusted for range but also for the potentially new and different sized source (Or am I just rambling)

gnosygnu commented 5 years ago

I thought I would take a look at this and have noticed quote a lot of commented out code regarding UNIQ So I reinstated them to see what happens

Yeah, I added this a while ago. I forget why I left it commented (probably did not want to risk changing behavior)

Let me put it on tab for this weekend. Thanks.

gnosygnu / xowa

<nowiki> appearing where it should not #259