rougier / freetype-gl

OpenGL text using one vertex buffer, one texture and FreeType
Other
1.65k stars 266 forks source link

texture_font_get_glyph should either take a uint32_t or return how many characters it "ate" #107

Open ec1oud opened 8 years ago

ec1oud commented 8 years ago

The examples which go through a char * UTF-8 string byte-by-byte (like lcd.c) are wrong, and the API is not conducive to fixing it. If you start with a UTF-8 string, you need to iterate by Unicode codepoints, not by bytes, but there is no function which tells you where the next one is. So, replacing wchar with char * wasn't a great move: this function's name tells you that its job is to get one glyph, not to decode the prefix of a UTF-8 string. And an application which wants to step through the Unicode characters will decode it before calling this function, so it doesn't make sense for the function to turn around and decode it again. It should take a pre-decoded Unicode character.

My reason is that I want to look for formatting characters like tabs and newlines as I go. But that's not the only reason.

ec1oud commented 8 years ago

Also the API change broke my existing code, because this is C - the compiler is perfectly happy to pass an int (the decoded character) as a char *, and then it's an invalid pointer. I was using a version from a year or so ago, tried to upgrade, and boom.

rougier commented 8 years ago

What do you mean when you say lcd.c is wrong ?

adrianbroher commented 8 years ago

The examples which go through a char * UTF-8 string byte-by-byte (like lcd.c) are wrong, and the API is not conducive to fixing it. If you start with a UTF-8 string, you need to iterate by Unicode codepoints, not by bytes, but there is no function which tells you where the next one is.

The examples in itself are correct as they don't contain Unicode codepoints above U+007F. Therefor is no need to use special UTF-8 functions. If you need the length of a single UTF8 codepoint in bytes you can use utf8_surrogate_len declared in utf8-utils.h.

So, replacing wchar with char * wasn't a great move […]

wchar_t isnt able to hold Unicode codepoints beyond the BMP on some targeted platforms so it unsuited for the task of passing the code point identity. Even uint32_t is unsuited for the task if you take codepoint variants into account. So passing an UTF-8 encoded char pointer is the only suitable ways of passing enough information into the function and keeping the API interface simple without breaking the API again in the future.

this function's name tells you that its job is to get one glyph, not to decode the prefix of a UTF-8 string.

Please be precise: what function are you talking about?

And an application which wants to step through the Unicode characters will decode it before calling this function, so it doesn't make sense for the function to turn around and decode it again. It should take a pre-decoded Unicode character.

Why should decoding be required in the first place? Unicode processing can be done without decoding the codepoints at all.

I was using a version from a year or so ago, tried to upgrade, and boom.

I'm sorry that the code change caused trouble for you.

As you can see just this project hasn't any versioning or changelog going on. The reasons for that are:

  1. A far as I understood @rougier from earlier discussion the code wasn't intended to be a library but more of a research and playground base where some people happened to contribute to.
  2. I didn't want to introduce any versioning as long as the API is unsuited to be a library API. Currently it
    • uses functions names that are too generic and could cause name collision with other projects.
    • bleeds implementation internals through the what could be the future public API.
    • Some concepts like text shaping or the selection of the glyph renderer don't have a uniform API design (any design at all).
    • has some additional other problems, which aren't that much of a deal breaker but should still be fixed before declaring an future API as final.
  3. When writing a changelog you need reference points to describe a change set from. As we don't have any version going on the only reference points that are available are commits or dates. Using any of those would lead to be nothing else than a copy of the commit history.

So introducing some version schema in the current state of the code would force us to guarantee some API or even ABI stability we can't provide yet.

adrianbroher commented 8 years ago

Please be precise: what function are you talking about?

Scratch that. I ignored the title. Well, the texture_font_get_glyph does exactly what it advertises, it returns the glyph for a unicode codepoint (encoded as UTF8), nothing more. It just happens that you can conveniently pass the pointer pointing to the start of an UTF-8 encoded unicode codepoint without doing any conversion beforehand (e.g. what you call decoding the prefix of a UTF-8 string).

ec1oud commented 8 years ago

I think 32 bits ought to be enough for a "character" type, at least most of the time, but OK... edge cases are endless. Lemme guess, 64 bits isn't enough either?

How about a way to find out how many bytes it consumed in the process of rendering the next glyph then? E.g. take a char* argument and modify it to point to the next byte past the end of the glyph that it found, or take an int \ and return the count of bytes consumed. That way a simple loop could get all glyphs in a string of unknown length, deal with control characters, and prepare the vertex buffer all at the same time.

lcd.c has only an English string so far, but bytewise iteration is not a good example of how to actually use the library.

adrianbroher commented 8 years ago

I think 32 bits ought to be enough for a "character" type, at least most of the time, but OK... edge cases are endless. Lemme guess, 64 bits isn't enough either?

The problem with using a uint32_t parameter is the missing context (you can't read the next character, which may modify the representation of the next character). Using a pointer to a uint32_t however would work again because it provides context. But using a uint32_t parameter makes unnecessary hard to use string literals.

How about a way to find out how many bytes it consumed in the process of rendering the next glyph then?

You seem to misunderstand. texture_font_get_glyph doesn't consume anything.

The following code snipplet probably does what you're trying to achieve.

// create a texture_font_t instance `the_font` before that.
char* the_string = "私はガラスを食べられます。 それは私を傷つけません";

for(size_t i = 0; the_string[i] != '\0'; i += utf8_surrogate_len(the_string + i))
{
    texture_glyph_t the_glyph = texture_font_get_glyph(the_font, the_string + i);
    // further processing here
}