gnosygnu / xowa

xowa offline wiki application
Other
374 stars 41 forks source link

spurious characters on page #377

Closed desb42 closed 5 years ago

desb42 commented 5 years ago

I have just merged the latest changes with my version and started to get some spurios characters appearing. I have now rebuilt a baseline xowa using xowa_get_and_make.sh and the same thing occurs extrachars There is a line

</ </ </t

that should not be there

gnosygnu commented 5 years ago

Ugh. Sorry about that. This was related to #366 and this erroneous belief: Fixed a variety of bugs related to using supplementary characters (codepoints that are 3 bytes and 4 bytes: think Chinese characters)

I thought that 3 byte UTF-8 sequences were 2 Java chars. Instead, they fit in 1 Java char, b/c UTF-8 defines the last code-point for 3 bytes as 0xFFFF. This became a problem for https://en.wikipedia.org/wiki/Template:Infobox_U.S._state because it uses the character which is a 3-byte UTF-8 sequence. A lot of weird things ensue from that bad assumption. See sample wikitext below

I fixed it in the commit above. Thanks for the quick report, as it helped me narrow down the change to the last week or two, and sorry again for the error on my side.


{{Infobox
| data1 = {{Infobox|child=yes
  | label1 = &nbsp;•
  | data1 = ba
}}
}}
desb42 commented 5 years ago

UTF-8 does my head in I too had narrowed it down to some info box, but had not identified the utf issue

Great work

btw on internationalisation, the icu4j package really should be updated please see my comments in #237

gnosygnu commented 5 years ago

UTF-8 does my head in

Yeah, some of my conventions are also incorrect (Utf16 should be renamed to Unicode and several functions should be moved to Utf8_). I'll refactor that class sometime this weekend.

I too had narrowed it down to some info box, but had not identified the utf issue

Nice. On my side, I knew the results were so weird, it had to be low-level. It was either that or the Luaj change (and the Luaj change I eliminated earlier)

btw on internationalisation, the icu4j package really should be updated

Oops. Sorry. I tagged it now with my new crayon set. I'll have it done this weekend. Feel free to bump anything that's open. If I tag it with [schedule 1 - within days], then there's a higher probability that it might be true. :disappointed: