Unicode UTF8 dont displayed correctly

JaydenSu commented 1 year ago

In continuation of the issue: https://github.com/raysan5/raylib/issues/2022

If I understood correctly from this "issue", then the code:

std::string string_ = u8"Comença a jugar çà";
DrawText(&string_[0], rec.x + 10, rec.y + 60, 60, BLACK);

Should be displayed on the raylib window without any problems - and it is, everything is displayed perfectly.

BUT, this code does not want to be displayed correctly:

string_ = u8"Click me!东东2";
DrawText(&string_[0], rec.x + 10, rec.y + 10, 60, BLACK);

Please tell me - why?

raysan5 commented 1 year ago

You need to provide a font containing the required glyphs and load them. Please, check the text_unicode example.

Please, also note that this is not a forum for questions, follow the required Issues template before publishing new issues or ask in raylib Discord, probably the community can hel with this kind of questions.

JaydenSu commented 1 year ago

Thank you.

By the way, the example from "text_unicode.c" - doesn't work if I pass a string: u8"东东东"

PS: That is, in fact, we can say in raylib There is no support for Unicode and UTF8 encoding in particular.

orcmid commented 1 year ago

You cannot say that. It depends on the availability of the fonts. The UTF8 encodings should be fine.

If you display those strings in hexadecimal, they should reveal that correct UTF8 is employed in both cases.

raylib is not responsible for the support of CJK glyphs in a chosen font. Font providers handle that or not. I suspect the default font used in raylib DrawText() does not support any glyphs except those graphical ones in u0000 - u00ff, and that's what you have demonstrated.

As Ramon suggests, ask for font solutions on Discord, although I would expect that your computer has one, because you can enter and see them.

JaydenSu commented 1 year ago

You cannot say that. It depends on the availability of the fonts. The UTF8 encodings should be fine.

I can, I already said.

As Ramon suggests, ask for font solutions on Discord, although I would expect that your computer has one, because you can enter and see them.

The fonts that are in the examples work. But firstly, they do not work by default. And secondly - they work, My apologies, but they work through the ass.

The fonts that are in the examples work. But firstly, they do not work by default. And secondly - they work, my changes, but they work through the ass.

Therefore, we can say that raylib does not, in fact, support the output of Unicode text, or it does, but with very complex manipulations.

That is, you cannot receive, for example, an array of bytes encoded for example UTF8 from the server and display it in the window. And this is a fact.

raysan5 commented 1 year ago

I'm sorry it didn't work for you. Please note raylib is a free and open source project, you can contribute with features, improvements or clearer examples. Contributions are welcome.

orcmid commented 1 year ago

@JaydenSu

I can see the calligraphy in

u8"东东东"

because I do have a font to do it (or GitHub is creating tiny images, which I do not believe is the case). That does not mean that everyone on GitHub can see it.

In email, the situation is worse. There are many characters I cannot use because different readers (or the list servers that forward to them) do not have the fonts or do not recognize/preserve messages in UTF-8.

That I have the fonts on my machine already is happenstance, although perhaps likely these days.

There are other situations such as running a console application, where it is more difficult to get Unicode rendering even though I know the application is using UTF8.

raylib default fonts in examples is another such case. What raylib does not do is selective font fall-back and substitution when a Unicode code point is not satisfied in the current font and there are other (known) fonts that have glyphs for that code point. That is much to expect from raylib. Also from the underlying graphical libraries that are used for text display.

So would you prefer that it be declared raylib does not handle UTF8 at all? Then we'd have to select code pages and double-byte encoding schemes. Is that an improvement?

With UTF8 recognized and handled, at least there is a direction for expanding the support of Unicode code points in text graphics at some point. Are you prepared to dig into word-processing software, see how this is managed there, and find a comparable solution for raylib?

PS: I did paste u8"东东东" successfully into Microsoft Word, LibreOffice Writer, and Apache OpenOffice Writer documents. The last two are open source and all three use an open format (OOXML and ODF) that can be inspected to see how that is accomplished in document parameters.

PPS: You can compile UTF-8 into a program. How well are you able to output it in a shell/terminal application?

JaydenSu commented 1 year ago

@JaydenSu

I can see the calligraphy in

u8"东东东"

because I do have a font to do it (or GitHub is creating tiny images, which I do not believe is the case). That does not mean that everyone on GitHub can see it.

In email, the situation is worse. There are many characters I cannot use because different readers (or the list servers that forward to them) do not have the fonts or do not recognize/preserve messages in UTF-8.

That I have the fonts on my machine already is happenstance, although perhaps likely these days.

There are other situations such as running a console application, where it is more difficult to get Unicode rendering even though I know the application is using UTF8.

raylib default fonts in examples is another such case. What raylib does not do is selective font fall-back and substitution when a Unicode code point is not satisfied in the current font and there are other (known) fonts that have glyphs for that code point. That is much to expect from raylib. Also from the underlying graphical libraries that are used for text display.

So would you prefer that it be declared raylib does not handle UTF8 at all? Then we'd have to select code pages and double-byte encoding schemes. Is that an improvement?

With UTF8 recognized and handled, at least there is a direction for expanding the support of Unicode code points in text graphics at some point. Are you prepared to dig into word-processing software, see how this is managed there, and find a comparable solution for raylib?

PS: I did paste u8"东东东" successfully into Microsoft Word, LibreOffice Writer, and Apache OpenOffice Writer documents. The last two are open source and all three use an open format (OOXML and ODF) that can be inspected to see how that is accomplished in document parameters.

PPS: You can compile UTF-8 into a program. How well are you able to output it in a shell/terminal application?

I understand it. So, apparently, I'm too used to simply passing text in UTF16 encoding to WinApi functions, in particular GDI + functions, and they themselves figured out what and how to draw "under the hood". Such functions can be called "easy to use", but in raillib - this is, unfortunately, Hell.

orcmid commented 1 year ago

@JaydenSu "So, apparently, I'm too used to simply passing text in UTF16 encoding to WinApi functions, in particular GDI + functions, and they themselves figured out what and how to draw 'under the hood'."

Yes, GDI is very high level in that respect, and there is an entire operating system behind it.

Raylib is not at that level; it is much simpler in terms of what it makes easy to accomplish. That's very different.

I decided to look at something like DirectX to see what it affords. This article is pretty scary compared to what raylib provides: https://learn.microsoft.com/en-us/archive/msdn-magazine/2013/october/directx-factor-text-formatting-and-scrolling-with-directwrite

The interesting thing is at the beginninng. A font was named as part of setting up text. From the examples, it appears that font metrics are handled and so are a number of other features around types of fonts. This all works by using platform-specific facilities.

I am reminded that raylib's inspiration is work by Borland in the 1980s. None of these complexities and capabilities were prevalent in those times.

I am disappointed that your expectations are not satisfied. It would be useful to explain and demonstrate the successful use of Unicode across a variety of languages and glyph forms. That has not been done here. It is probably best not to suggest much about the abiliity to pass UTF-8 (or Wide character UTF-16) around.

orcmid commented 1 year ago

I am still curious about this and I found out some things about attempting to present text with OpenGL. I found this report. It provides much useful information.

https://learnopengl.com/In-Practice/Text-Rendering

It doesn't get anywhere close to dealing with UTF-8 and UTF-16 but it reveals the rather limited handling. There is emphasis on the creation of bitmaps for glyphs and also use of TrueType for scaling glyphs without the awful pixelation that happens when a bitmap is enlarged. FreeType is suggested as a library of choice.

I am not certain why the availability to use that on native windows seems to be suppressed, possibly for ideological reasons. There is a tolerable (BSD-like) license choice.

I have not examined raylib's rtext.c file very closely. The use of TrueType (TTF) is a compilation option. Also, the built-in default font is strictly raster (as one can tell by struggling with different sizes) and only supports the UTF-8 code points for graphic glyphs in the interval U+0000 to U+00FF. That accounts for the lack of support for additional code points by default, especially the CJK families, etc.

For handling TrueType, external/stb_rect_pack.h and external/stb_truetype.h are used. I notice that type int is used for code points and I suspect that won't do well for Unicode code-points beyond U+FFFF except when compiled for x64 (wild guess).

None of this is very enccouraging.

chriscamacho commented 1 year ago

https://en.wikipedia.org/wiki/2,147,483,647 Using a TTF font with the code points that you want to use is trivial and works, you just need to put some (minor) effort in. Raylib is intended to be a LOW level C library, not an all in one high level library...

orcmid commented 1 year ago

@chriscamacho If those libraries were explicit about uint32 rather than whatever the compiled-for model (signed) int, I would not have mentioned it. (C99 requires that int be at least 16 bits and it is long that is at least 32 by the way.) Of course the largest used Unicode code point in UTF-8 is U+10FFFF as of Unicode 7.0. I surmise the scheme can go to U+1FFFFF without adding an additional first-byte rule to have a 5-byte UTF-8 sequence.

That's an incidental concern. If modification of rtext.c for this purpose is trivial, it would be useful to have it done in raylib with or without any extension to the API. My impression is that compiling with TTF enabled is not the default.

It appears that the API has all the interfaces needed although I have some concern for what it would take to work with something like CJK with a ton of code points. A code-point vector would be quite lengthy and then there is the generation of different font sizes for rasterizing. (There's a related consideration with regard to IME and input of codes for such characters.) I think these concerns apply for other non-alphabetic languages.

Is there advice about all this somewhere? I am unfamiliar with the wiki. My neglect. Sorry.

Finally, I agree about the level at which raylib sits in terms of graphics, GUIs, and game-engine support. I admire the inspiration from BGI.

orcmid commented 1 year ago

@JaydenSu There are examples of non-trivial Unicode usage in the raylib/examples/text/ folder. The *.png files there are informative. The only use of CJK glyphs seems to be from a raster font, unfortunately (or the TTF-generated rasters are up-scaled). See the text/resources/noto_cjk.png.

designerfuzzi commented 8 months ago

i can confirm it does not work. in my case

int codepoints[] = {  0x0391, 0x0101};
int utf8Size = 2;
const char *text = CodepointToUTF8(codepoints[1], &utf8Size);
printf("%s",text); //prints just fine the letter.

utf8Size makes not even sense cause the standard is either \U00001234 or \u1234

Font boldnumber = LoadFontEx(otfpath2, 48, codepoints, 1); //<-- crash

will crash.

Font boldnumber = LoadFontEx(otfpath2, 48, NULL, 1024*8);

works just fine, despite a lot memory eaten.

orcmid commented 8 months ago

@designerfuzzi

int codepoints[] = {  0x0391, 0x0101};
int utf8Size = 2;
const char *text = CodepointToUTF8(codepoints[1], &utf8Size);
printf("%s",text); //prints just fine the letter.

utf8Size makes not even sense ...

Please notice that utf8Size is given as an output parameter where the derived UTF8size is returned, not provided.

In your example, codepoints[1] of 0x0101 is for Unicode Latin Small Letter A with Macron (ā), and its UTF8 is the two-octet sequence C4 81 so the UTF8 size for that one code point is indeed 2 bytes and that is what you should find in text[0] and text[1].

In the LoadFontEx(otfpath2, 48, codepoints, 1); you are loading the u+0391, not the u+0101. Is that a problem there? I don't see how that has anything to do with UTF8.

PS: I don't see anything in ā or in the capital Greek Letter Α that would make any difficulties with respect to rasterization of the font.

designerfuzzi commented 8 months ago

&utf8Size is an inout to be precise. Now that i read the rtext.c i know what it does.

lets pinpoint whats happening. LoadFontEx(fontpath, 24, NULL, 0); reads up to 95 glyphs as written in rtext.c. The majority of unicodes are outside the range of 0..95 aka 0x00..0x5F, which means any text that contains unicodes higher then \u005F (and that are basically all unicodes) only load when explicit codepoints are given, that makes perfectly sense only after reading rtext.c.

As i know now unicode-glyph loading works only when codepoints are known, we give the codepoints table as argument upfront or parse the codepoints out of text with LoadCodepoints().

So found following works fine,

const char *text = "\u0391\u0101"; //<-- 2byte UTF-8 unicode notation
int count = 0;
int * codepoints = LoadCodepoints(text, &count);
Font font = LoadFontEx(fontpath, 24, codepoints, count);

which was proof the font is not broken, the source has an undetected vulnerability

I first tried it with just two manually given codepoints without parsing from text as written in the example above. This way i crashed raylib with a freeze, not just error, the entire computer froze. So i went on seeking the error, and of course i did not exclude the possibility that fonts can be broken. But knowing it is not broken as i can perfectly load it in an ios app or mac app or web that scenario is highly unlikely.

My earlier attempt with just two or one codepoint crashed at LoadFontEx() and then without report with a freeze. So i might have found something that is not excluded as scenario. Because if that is not closed, it is frankly spoken a vulnerability that can be exploited by simple text that can and will contain unicode letters. I was more expecting failure to result in ? glyph or even "" but it froze.

crashing code:

int codepoints[] = {  0x0391, 0x0101};
Font font = LoadFontEx(fontpath, 24, codepoints,1);

So i tried

//int codepoints[] = {  0x0391, 0x0101};
Font font = LoadFontEx(fontpath, 24, NULL,8000); //forced reading.

does not crash but eats memory of course.

Now that i figured out a way to handle unicodes the following test code ends in faulty malloc checksums Here the test code.

Font LoadFontWithCodepointsFromText(const char *path, int fontSize, const char *text) {
    int count = 0;
    int * codepoints = LoadCodepoints(text, &count);
    Font result = LoadFontEx(path, fontSize, codepoints, count);
    UnloadCodepoints(codepoints);
    return result;
}
int main(int argc, const char * argv[]) {
    float screenWidth = 800.0f;
    float screenHeight = 450.0f;

    //SetTraceLogLevel(LOG_NONE);
    //SUPPORT_PARTIALBUSY_WAIT_LOOP OFF, fix 4% cpu load in idle
    //SUPPORT_EVENTS_WAITING OFF, fix 2% cpu load in idle

    SetConfigFlags(FLAG_WINDOW_RESIZABLE);
    InitWindow(screenWidth, screenHeight, "Button Example");

    const char* otfpath1 = "/Users/Name/Font-Regular.otf";
    Font regular = LoadFontEx(otfpath1, 24, NULL, 0);

    const char* otfpath2 = "/Users/Name/Font-BoldNumber-Regular.otf";

    const char * text = "\u0101\u0391";

    int fontSize = 24;
    Font boldnumber = LoadFontWithCodepointsFromText(otfpath2,fontSize,text);

    Image image = GenImageColor(512, 250, BLANK);
      //ImageDrawTextEx(&image, boldnumber, text, (Vector2){ 1, 1 }, 12, 0, WHITE);
      ImageDrawTextEx(&image, regular, "Hello World", (Vector2){ 0, 0 }, 24, 0, WHITE); <---- Crash
      ImageFlipVertical(&image);
      Texture2D texture = LoadTextureFromImage(image);
    UnloadImage(image);

    RenderTexture2D backing = LoadRenderTexture(512, 250);
    BeginTextureMode(backing);
      ClearBackground(BLACK);
      DrawTexture(texture, 0, 0, WHITE);;
    EndTextureMode();

    //EnableEventWaiting(); //TODO: usleep() 4%cpu vs mach_wait_until 0%cpu

    SetTargetFPS(24);
    while (!WindowShouldClose()) {
       BeginDrawing();
       ClearBackground(BLANK);
       DrawTexture(backing.texture, 0, 0, WHITE);
       EndDrawing();
    }
    UnloadRenderTexture(backing);
    UnloadTexture(texture);
    UnloadFont(regular);
    UnloadFont(boldnumber);
    CloseWindow();
    return 0;
}

Give me the following..

...
INFO: FONT: Default font loaded successfully (224 glyphs)
INFO: FILEIO: [/Users/Name/Font-Regular.otf] File loaded successfully
INFO: TEXTURE: [ID 3] Texture loaded successfully (256x256 | GRAY_ALPHA | 1 mipmaps)
INFO: FONT: Data loaded successfully (24 pixel size | 95 glyphs)
INFO: FILEIO: [/Users/Name/Font-BoldNumber-Regular.otf] File loaded successfully
INFO: TEXTURE: [ID 4] Texture loaded successfully (256x128 | GRAY_ALPHA | 1 mipmaps)
INFO: FONT: Data loaded successfully (24 pixel size | 2 glyphs)
OctaRay(4293,0x1001865c0) malloc: Incorrect checksum for freed object 0x106058200: probably modified after being freed.
Corrupt value: 0xaaffff00000000
OctaRay(4293,0x1001865c0) malloc: *** set a breakpoint in malloc_error_break to debug

And now i think this might be related to the former unexpected behaviour. The fonts are fine for 100% sure. Doesnt matter which of the two font variants i try to draw to image with ImageDrawTextEx, the malloc fails.

Specific weirdo because it works, then it works not. Just for proof here screenshots.

second compile attempt, right after..

In particular look at the memory footprint in the upper left, the second attempt tells only 17mb, which i guess means the unicode font loading leaves memory faulty or inaccessible. When i leave the unicode variant out of the entire process, it works fine. The difference between them is unicode glyph variant is larger than line-height, from top down, and do not show up complete with raylib, only the lineheight is visible. Not directly unexpected when fonts get placed on textures. Thats the only clue that could explain this..

designerfuzzi commented 8 months ago

feedback: found the problem. in rtext.c function GenImageFontAtlas() the fontSize is taken as granted, therefore rendering in an insufficient image size can result in memory corruption.

Means multiple functions did not detect the possibility that a font glyph can have an image height that does not fit into its atlas, so packing fails on such glyphs, writing in memory that is not allocated likely corrupts the process. Which was the reason why i got mem checksum errors.

here my quick fix, for sure not ideal as it utilises fontSize. #define WORKAROUND

// Calculate image size based on total glyph width and glyph row count
int totalWidth = 0;
int maxGlyphWidth = 0;

WORKAROUND int maxGlyphHeight = 0;

for (int i = 0; i < glyphCount; i++)
{
    if (glyphs[i].image.width > maxGlyphWidth) maxGlyphWidth = glyphs[i].image.width;
    totalWidth += glyphs[i].image.width + 2*padding;
    WORKAROUND if (glyphs[i].image.height > maxGlyphHeight) maxGlyphHeight = glyphs[i].image.height;
}

WORKAROUND if (fontSize<maxGlyphHeight) fontSize = maxGlyphHeight;

Accordingly function MeasureTextEx() which is used when ImageDrawTextEx() is processed, also returns for sure only fontSize as evaluated height, no matter what, which is likely easy to code with but not correct - because the function makes one expect to get the actual size of text, rather then just evaluated width and non-evaluated ~~fontSize~~ height.

Means MeasureTextEx() could use an improvement as well. like so .. perhaps.

if (letter != '\n')
{
    if (font.glyphs[index].advanceX != 0) textWidth += font.glyphs[index].advanceX;
    else textWidth += (font.recs[index].width + font.glyphs[index].offsetX);
    WORKAROUND if (textHeight < font.glyphs[index].image.height) textHeight = font.glyphs[index].image.height;
}

Last isse, handing out a const char * string solely with unicode can lead to missing nullTerminator depending on IDE. When i forced my const char * text = "\u0391"; to have an nullTerminator no matter what, it never complains about its value. See where the problem comes from? Unicode can express control letters that trick debugger, precompiler in weirdo states. They must be nullterminated (actually one would expect they are by default via IDE). So i wrote a simplification for my case.. which looks like..

char *nullTerminate(const char *inp) {
    size_t len = strnlen(inp, 255);
    char *txt = (char *)malloc(len+1);
    if (txt==NULL || inp==NULL) return NULL;
    txt[len] = '\0';
    while (len--) txt[len]=inp[len];
    return txt;
}
Font LoadFontWithCodepointsFromText(const char *path, int fontSize, const char *text) {
    char *check = nullTerminate(text);
    if (check!=NULL)  {
        int count = 0;
        int * codepoints = LoadCodepoints(check, &count);
        Font result = LoadFontEx(path, fontSize, codepoints, count);
        UnloadCodepoints(codepoints);
        free(check);
        return result;
    }
    return GetFontDefault();
}

Now everything works.. MeasureTextEx(), ImageDrawTextEx(), DrawTextEx() as expected.

Still visible that the TextureSize for the Atlas before packing is not ideal as it just takes the proportional attempt. But at least a starting point

raysan5 commented 8 months ago

@designerfuzzi Wow! This is very interesting, I assumed the glyphs image height could not be bigger than requested font size... I'm opening a separate issue to review this: https://github.com/raysan5/raylib/issues/3858

raysan5 / raylib

Unicode UTF8 dont displayed correctly #2991