Open desb42 opened 5 years ago
found a fix for the encoding (have updated my branch)
Wow! That's quite a bit of work at https://github.com/desb42/xowa/commits/depurate
I parked my work from #413. Let me start working on this now.
Thanks!
Hey, so I incorporated your change in the commit above. Some notes:
Finally, I'm going to start tagging issues with code submission as misc - commit submitted
. This will give them higher priority, which is the least I can do after effort was spent putting together a patch. Feel free to add this tag to new issues (or old ones I missed).
Thanks again for all the work!
How to check
vnu
I took the wikitext of home/wiki/Diagnostics/HTML_Tidy and chucked it at en.wikipedia.org/wiki/Wikipedia:Sandbox and got results surprisingly close to vnu tidy
That is, some of the anticipated behaviours have changed (I think)
I think the 'bold and DT / DD ' should be resolved by coding changes to the 'state-machine' of Xop_tblw_wkr (rather than a tidy)
Also in reviewing the code I have come across an intriguing issue, multiple \ entries in the generated html (but only with some combinations)
This cut down version produces the effect
== merge nested tables and take 1st table's attributes (http://en.wiktionary.org/wiki/Kazakhstan#Declension) ==
<code>fails if "text does not line up on left"</code><br/>
{| width="40%" class="wikitable"
|-
|
{|
|-
{| style="float: right;"
|
|-
|-
| text should line up on left
|}
|}
== table inside span should be reparented correctly (http://ca.wikipedia.org/wiki/Evolució -- Infobox) ==
<code>fails if "text should be visible" is missing</code><br/>
<span style="z-index:-1; position:absolute;">
{|
|-
|text should be visible
|-
|}
</span>
== div not inside td should be reparented correctly (http://fr.wikivoyage.org/wiki/Marrakech -- Infobox) ==
<code>fails if "should appear above table" is inside table</code><br/>
{|class=wikitable
|-
| cell_1
|-
<div>
{|
| should appear above table
|}
</div>
|-
| cell_2
|}
A sample of the generated html
</p><p><span style="z-index:-1; position:absolute;"><table>
<tbody><tr>
<td>text should be visible
</td>
</tr>
</tbody></table></span></p><span style="z-index:-1; position:absolute;"><p class="mw-empty-elt">
</p></span><span style="z-index:-1; position:absolute;"><p><br />
</p></span><span style="z-index:-1; position:absolute;"><h2><span class="mw-headline"
weird
I have just been comparing the jtidy and vnu versions of home/wiki/Diagnostics/JTidy and can see no visual difference Did you?
I took the wikitext of home/wiki/Diagnostics/HTML_Tidy and chucked it at en.wikipedia.org/wiki/Wikipedia:Sandbox and got results surprisingly close to vnu tidy
Wow, I didn't even think that would be the case. (I automatically assumed vnu would be "backward compatible with tidy). Nice thinking on your side
I checked it now, and most of the items are the same. However, the following items look different
Also in reviewing the code I have come across an intriguing issue, multiple entries in the generated html (but only with some combinations)
Ok. But that's a vnu issue, right?
I have just been comparing the jtidy and vnu versions of home/wiki/Diagnostics/JTidy and can see no visual difference
Yeah, I should have mentioned it above. JTidy is surprisingly similar.
I'll try to look at this more later. I think having vnu as a new tidy engine is a good start. Once we resolve the issues above, we can make it the default
bold and DT / DD
I believe this to be a parser error (not a tidy issue)
;'''bold'''
:should not be bold
produces the correct dl/dt/dd combo whereas
;'''bold'''
:*should not be bold
Does not
I think it comes down to the confusion over the colon/asterisk.
The one line
:*should not be bold
produces the right dl/dd html
empty list items should be removed
vnu and mediawiki add an extra class to empty items (and additional css is used to hide them)
the class is class="mw-empty-elt"
(an extra class is also added to \
table inside span should be reparented correctly
the wikitext on its own
<span style="z-index:-1; position:absolute;">
{|
|-
|text should be visible
|-
|}
</span>
produces the correct (reparented?) html with vnu
I think the major issue to resolve is the one I termed intriguing and you rightly indicated, somehow vnu is adding extra absolute \s
Having done some more investigation, I have found what is causing this particular intriguing situation.
The first table entry merge nested tables and take 1st table's attributes
has three open table {|
entries and only two closing |}
Adding one more close |}
makes all the other fixes work
It seems that vnu's attempt to fix the unclosed table causes internal strife. Although it does fix the table in a very similar fashion to mediawiki, subsequent things are messed up
CompatibilitySerializer.java is the file to fix; however still trying to find how
This is a proposed replacement to jtidy
Using the Nu Html Checker at https://github.com/validator/validator/releases
I have added the libraries vnu.jar and sax.jar to the project I have also borrowed some logic from https://github.com/wikimedia/html5depurate
by changing Xoh_page_wtr_wkr.java I have been able to test this.
Most of it works (from a tidy point of view); however ...
my coding does not cope well with Unicode I suspect I am not using the correct encoding - I dont know how to do that
Please take a look at my branch depurate