gnosygnu / xowa

xowa offline wiki application
Other
375 stars 41 forks source link

replace jtidy with vnu #417

Open desb42 opened 5 years ago

desb42 commented 5 years ago

This is a proposed replacement to jtidy

Using the Nu Html Checker at https://github.com/validator/validator/releases

I have added the libraries vnu.jar and sax.jar to the project I have also borrowed some logic from https://github.com/wikimedia/html5depurate

by changing Xoh_page_wtr_wkr.java I have been able to test this.

Most of it works (from a tidy point of view); however ...

my coding does not cope well with Unicode I suspect I am not using the correct encoding - I dont know how to do that

Please take a look at my branch depurate

desb42 commented 5 years ago

found a fix for the encoding (have updated my branch)

gnosygnu commented 5 years ago

Wow! That's quite a bit of work at https://github.com/desb42/xowa/commits/depurate

I parked my work from #413. Let me start working on this now.

Thanks!

gnosygnu commented 5 years ago

Hey, so I incorporated your change in the commit above. Some notes:

Finally, I'm going to start tagging issues with code submission as misc - commit submitted. This will give them higher priority, which is the least I can do after effort was spent putting together a patch. Feel free to add this tag to new issues (or old ones I missed).

Thanks again for all the work!


How to check

desb42 commented 5 years ago

I took the wikitext of home/wiki/Diagnostics/HTML_Tidy and chucked it at en.wikipedia.org/wiki/Wikipedia:Sandbox and got results surprisingly close to vnu tidy

That is, some of the anticipated behaviours have changed (I think)

I think the 'bold and DT / DD ' should be resolved by coding changes to the 'state-machine' of Xop_tblw_wkr (rather than a tidy)

Also in reviewing the code I have come across an intriguing issue, multiple \ entries in the generated html (but only with some combinations)

This cut down version produces the effect

== merge nested tables and take 1st table's attributes (http://en.wiktionary.org/wiki/Kazakhstan#Declension) ==
<code>fails if "text does not line up on left"</code><br/>

{| width="40%" class="wikitable"
|-
|
{|
|-
{| style="float: right;"
|
|- 
|-
| text should line up on left
|}
|}

== table inside span should be reparented correctly (http://ca.wikipedia.org/wiki/Evolució -- Infobox) ==
<code>fails if "text should be visible" is missing</code><br/>

<span style="z-index:-1; position:absolute;">
{|
|-
|text should be visible
|-
|}
</span>

== div not inside td should be reparented correctly (http://fr.wikivoyage.org/wiki/Marrakech -- Infobox) ==
<code>fails if "should appear above table" is inside table</code><br/>

{|class=wikitable 
|-
| cell_1
|-
<div>
{|
| should appear above table
|}
</div>
|-   
| cell_2
|}

A sample of the generated html

</p><p><span style="z-index:-1; position:absolute;"><table>
        <tbody><tr>
          <td>text should be visible
</td>
</tr>
</tbody></table></span></p><span style="z-index:-1; position:absolute;"><p class="mw-empty-elt">
</p></span><span style="z-index:-1; position:absolute;"><p><br />
</p></span><span style="z-index:-1; position:absolute;"><h2><span class="mw-headline" 

weird

desb42 commented 5 years ago

I have just been comparing the jtidy and vnu versions of home/wiki/Diagnostics/JTidy and can see no visual difference Did you?

gnosygnu commented 5 years ago

I took the wikitext of home/wiki/Diagnostics/HTML_Tidy and chucked it at en.wikipedia.org/wiki/Wikipedia:Sandbox and got results surprisingly close to vnu tidy

Wow, I didn't even think that would be the case. (I automatically assumed vnu would be "backward compatible with tidy). Nice thinking on your side

I checked it now, and most of the items are the same. However, the following items look different

Also in reviewing the code I have come across an intriguing issue, multiple entries in the generated html (but only with some combinations)

Ok. But that's a vnu issue, right?

I have just been comparing the jtidy and vnu versions of home/wiki/Diagnostics/JTidy and can see no visual difference

Yeah, I should have mentioned it above. JTidy is surprisingly similar.

I'll try to look at this more later. I think having vnu as a new tidy engine is a good start. Once we resolve the issues above, we can make it the default

desb42 commented 5 years ago

bold and DT / DD

I believe this to be a parser error (not a tidy issue)

;'''bold'''
:should not be bold

produces the correct dl/dt/dd combo whereas

;'''bold'''
:*should not be bold

Does not

I think it comes down to the confusion over the colon/asterisk.

The one line

:*should not be bold

produces the right dl/dd html

desb42 commented 5 years ago

empty list items should be removed

vnu and mediawiki add an extra class to empty items (and additional css is used to hide them) the class is class="mw-empty-elt" (an extra class is also added to \)

desb42 commented 5 years ago

table inside span should be reparented correctly

the wikitext on its own

<span style="z-index:-1; position:absolute;">
{|
|-
|text should be visible
|-
|}
</span>

produces the correct (reparented?) html with vnu

I think the major issue to resolve is the one I termed intriguing and you rightly indicated, somehow vnu is adding extra absolute \s

desb42 commented 5 years ago

Having done some more investigation, I have found what is causing this particular intriguing situation.

The first table entry merge nested tables and take 1st table's attributes has three open table {| entries and only two closing |}

Adding one more close |} makes all the other fixes work

It seems that vnu's attempt to fix the unclosed table causes internal strife. Although it does fix the table in a very similar fashion to mediawiki, subsequent things are messed up

CompatibilitySerializer.java is the file to fix; however still trying to find how