spheredev / neosphere

A lightweight game engine and development platform using JavaScript for game coding, based on the original Sphere engine by Chad Austin but with a redesigned, modern API and brand-new command-line development tools.
http://www.spheredev.org/
Other
105 stars 15 forks source link

Unicode/special characters no longer displaying correctly #36

Closed DaVince closed 9 years ago

DaVince commented 9 years ago

While I was messing with my old Pokémon games, I suddenly noticed that the é character is no longer displaying properly. In any game. Regardless of whether I set the character encoding in my JS files to UTF-8 or ISO-8859-1.

At first, I thought maybe the character was just missing from the system font, but nope. On any font. I wrote this test case:

function game() {
  SetFrameRate(10);
  var systemfont = GetSystemFont();
  var tahoma8 = LoadFont("tahoma8.rfn");
  while (!IsKeyPressed(KEY_ESCAPE)) {
    systemfont.drawText(20, 20, "Pokémon, not Pokèmon.");
    tahoma8.drawText(20, 40, "Pokémon, not Pokèmon.");
    FlipScreen();
  }
}

The result is shown in the screenshot below. snapshot-1433781751

On original Sphere 1.5, it only works with ISO-8859-1 encoded text files and it will look exactly the same when using UTF-8 text files.

fatcerberus commented 9 years ago

Interesting that you'd get different results depending on encoding. Duktape won't except ANSI text at all so minisphere has to convert it before passing it to the compiler (this was a big key to getting KR working). So the results should be identical.

In any case, I'm thinking the issue is that duk_require_string() is returning a UTF-8 string, while Sphere RFN fonts are set up for ISO-8859-1, so you get garbage characters.

It shouldn't be too hard to deal with, I'd just have to convert the string in the opposite direction when rendering. It might cause a performance hit though.

DaVince commented 9 years ago

The results are identical for minisphere, but in neither case is it displaying é or è. That's the issue I'm having right now. I probably overexplained a bit.

fatcerberus commented 9 years ago

What I think I'll do is, when running through text to be rendered, convert known characters back to their ANSI equivalents, and if no equivalent exists, treat it as some rarely-used character like 0xFF. This will ensure unknown Unicode characters get rendered as boxes while allowing normal extended ASCII to render as normal, all without incurring a noticeable performance impact.

We really should have a Unicode-capable text rendering facility these days, but the Sphere Font object is not the place to do it. RFN fonts are 8-bit so, thems the breaks.

DaVince commented 9 years ago

Kinda funny. I definitely Sphere having support for thousands of characters in an rfn file, but I simply can't find the option now. But yeah, now that I'm also a bit more knowledgeable about UTF-8 and all that, it seems pretty obvious why the rfn file format wouldn't be able to contain the full Unicode range.

fatcerberus commented 9 years ago

Yeah, I'll look at the code tonight and try to come up with a workable solution. With any luck there will be a fix waiting to be pulled in the morning. :smile:

I love working with these old games, by the way. They tend to flush out bugs that never show up with more modern efforts. Since Sphere was a lot newer back then, proper techniques hadn't been refined and games did a lot of things they really shouldn't have. It's quite rewarding to get them working.

Looks like v1.2.3 is going to have a ton of fixes in it by the time this is over.

fatcerberus commented 9 years ago

Try the latest master. This should now be fixed: c036e787d2db036b2ee7f54395471132c783fb2e

It was quite an adventure fixing this, all the UTF-8 implementations out there (including Duktape's) are horribly complex, which is a dealbreaker for me as I won't add code I can't personally understand. Then I found this:

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

This guy is amazing, he simplified UTF-8 decoding to a single, very simple function and a static lookup table. Every other implementation I've seen uses a huge tower of if/else if statements to do the same thing.

The upshot of all this? As it required a switch to 32-bit characters internally, the engine is fully Unicode-ready under the hood. We'd just need a font format that supports it. :smiley:

DaVince commented 9 years ago

snapshot-1433884135 :smile:

DaVince commented 9 years ago

Okay, maybe this is not entirely fixed: it segfaults when I try to use some symbols, like ©, with the system font.

fatcerberus commented 9 years ago

That's weird, it should replace unknown characters with 1Ah (substitute char, a placeholder box in practice), but the copyright symbol is in the system font... Hm.

DaVince commented 9 years ago

My test case was Crap-RPG. I converted the JS file to UTF-8 and died by touching a wall in order to see the game over message. Then it segfaulted when trying to run either of these lines in the script:

textbox("You can now get the FULL version by donating €1000 to the Utjeputjestraat 86 in Amsterdam, the Netherlands!", 0.8);
textbox("Crap-RPG, ©DaVince 2005.\nFor more good and/or crappy games, go to\nhttp://DaVince.sitesled.com/", 0.75);

with a segfault (Address boundary error).

fatcerberus commented 9 years ago

Found it! Characters are decoded like this:

while (utf8decode(&utf8state, &cp, *text++));

Once a character is decoded, this returns UTF8_ACCEPT (0), which breaks the loop. Apparently for whatever reason it rejects the euro sign €, so the decoder returns UTF8_REJECT (12). Once the decoder is in that state, it remains there indefinitely. Which means it keeps incrementing the pointer forever until it goes out of bounds and crashes.

This fixes the crash:

while (utf8decode(&utf8state, &cp, *text++) > UTF8_REJECT);

However, now it's replacing the copyright symbol with a box, which definitely didn't happen before. I will keep investigating.

fatcerberus commented 9 years ago

Huh, apparently it's just Crap-RPG that it doesn't render the copyright symbol properly. Either that something screwed up in the UTF-8 conversion. It worked in Spectacles when I sent text with those characters to miniConsole. Either way, corrupt UTF-8 text should no longer crash the engine as of commit 53fec04e115c7addde475ba716b6665ef149ea9e.

DaVince commented 9 years ago

Considering that the euro symbol turned into something else after conversion, it wouldn't surprise me if there's another character somewhere in the script that trips it up. But I wouldn't know where.

...I just tested in Pokémon GS/SS (a modified version that properly runs on minisphere) and it's displaying boxes instead of the é character. Yet the little test program above works fine. Note that these files were also converted to UTF-8, though.

DaVince commented 9 years ago

Found it. font.drawText() works correctly, but font.drawTextBox() causes boxes to appear.

fatcerberus commented 9 years ago

Ah, thanks. That was a big help.

The bug is that the internal function word_wrap_text() converts the text back to ISO-8859-1, which results in invalid characters when draw_text() tries to decode it as UTF-8. Oops! I should have known the crash was suspicious.

fatcerberus commented 9 years ago

Okay, this should fix it once and for all: 49d9b25fadd0c11478b53fea04f67a5d61f61f04

DaVince commented 9 years ago

snapshot-1433890954

Almost; the euro sign in the third line is missing.

fatcerberus commented 9 years ago

Here's something weird: If I take a euro sign from Character Map and paste it into a script, I get the box. However, running Crap-RPG as-is, I get the proper symbol. It appears there might be two different codepoints for it?

DaVince commented 9 years ago

I checked, but there is only one (U+20AC). (Though there are two related ones they are obviously different.) I also type my one with AltGr+5, which should just produce a regular euro symbol anyway.

No idea why it's doing this.

DaVince commented 9 years ago

I spotted another broken character: the ™ symbol.

snapshot-1433891787

Now I wonder if there are more characters in there that randomly don't work. Or is the trademark symbol just not possible?

fatcerberus commented 9 years ago

I spotted the issue with the euro sign at least: The codepoint for it in Unicode is out of range of an RFN font. For the most part the first 255 characters in Unicode match ISO-8859-1, but this is one case where it isn't, so minisphere replaces it due to the font not having 8,365 glyphs in it :wink:. I suspect the trademark symbol is the same deal.

The reason Crap-RPG works is that apparently when the script was converted to UTF-8, the conversion encoded the Euro sign as U+0080, which happens to match its CP 1252 value but is a control character in Unicode. As a result Unicode-aware text editors won't even display it properly.

I'd have to look at the Windows-1252 codepage to see which characters map to codepoints > 255 and add special cases for them. This won't block 1.2.3, but keep the issue open because I do want to fix it at some point.

DaVince commented 9 years ago

As a result Unicode-aware text editors won't even display it properly.

Yep, I experienced this; Geany (my text editor) showed a wide box with the text CON in it, I believe. I neglected to mention this because I thought it was an irrelevant conversion error that I manually fixed.

At least the majority of this is fixed for now. Again, nice work. :)

fatcerberus commented 9 years ago

Just a quick update, I plan to fix this for the 1.3 release. :smile:

fatcerberus commented 9 years ago

@DaVince As of commit 7e0bfe4a53f87e1e04237afc1d59c02b6bd1eef3, this should be completely fixed. Try it out for me?

DaVince commented 9 years ago

Well, there's compilation errors again. Dunno how to handle them this time.

obj/spherefs.c:439:1: error: static declaration of 'resolve_path' follows non-static declaration
 resolve_path(sandbox_t* fs, const char* filename, const char* base_dir, ALLEGRO_PATH* *out_path, fs_type_t *out_fs_type)
 ^
In file included from obj/minisphere.h:31:0,
                 from obj/spherefs.c:1:
obj/spherefs.h:22:13: note: previous declaration of 'resolve_path' was here
 extern bool resolve_path    (sandbox_t* fs, const char* filename, const char* base_dir, ALLEGRO_PATH* *out_path, fs_type_t *out_fs_type);
             ^
scons: *** [obj/spherefs.o] Error 1
scons: building terminated because of errors.
fatcerberus commented 9 years ago

I think I need to raise the MSVC warning level, it really should be catching these. It was an attempted refactoring aborted halfway through because I realized I didn't need to expose the path resolver to outside code.

Removing the static declarator from resolve_path() in spherefs.c should do it. Sorry for the inconvenience.

DaVince commented 9 years ago

Well, this bug is fixed. Nice work! I've encountered other issues though with the latest release.

fatcerberus commented 9 years ago

le sign

The cycle never ends...