Closed desb42 closed 5 years ago
Ugh. Sorry about that. This was related to #366 and this erroneous belief: Fixed a variety of bugs related to using supplementary characters (codepoints that are 3 bytes and 4 bytes: think Chinese characters)
I thought that 3 byte UTF-8 sequences were 2 Java chars. Instead, they fit in 1 Java char, b/c UTF-8 defines the last code-point for 3 bytes as 0xFFFF. This became a problem for https://en.wikipedia.org/wiki/Template:Infobox_U.S._state because it uses the character •
which is a 3-byte UTF-8 sequence. A lot of weird things ensue from that bad assumption. See sample wikitext below
I fixed it in the commit above. Thanks for the quick report, as it helped me narrow down the change to the last week or two, and sorry again for the error on my side.
{{Infobox
| data1 = {{Infobox|child=yes
| label1 = •
| data1 = ba
}}
}}
UTF-8 does my head in I too had narrowed it down to some info box, but had not identified the utf issue
Great work
btw on internationalisation, the icu4j package really should be updated please see my comments in #237
UTF-8 does my head in
Yeah, some of my conventions are also incorrect (Utf16 should be renamed to Unicode and several functions should be moved to Utf8_). I'll refactor that class sometime this weekend.
I too had narrowed it down to some info box, but had not identified the utf issue
Nice. On my side, I knew the results were so weird, it had to be low-level. It was either that or the Luaj change (and the Luaj change I eliminated earlier)
btw on internationalisation, the icu4j package really should be updated
Oops. Sorry. I tagged it now with my new crayon set. I'll have it done this weekend. Feel free to bump anything that's open. If I tag it with [schedule 1 - within days]
, then there's a higher probability that it might be true. :disappointed:
I have just merged the latest changes with my version and started to get some spurios characters appearing. I have now rebuilt a baseline xowa using xowa_get_and_make.sh and the same thing occurs There is a line
that should not be there