Getting at notdef unicode codepoints

harfbuzz / harfbuzzjs

Providing HarfBuzz shaping library for client/server side JavaScript projects

https://harfbuzz.github.io/harfbuzzjs/

Other

204 stars 35 forks source link

Getting at notdef unicode codepoints #70

Closed talltyler closed 1 year ago

talltyler commented 1 year ago

From my understanding var result = buffer.json(font); should return an array of https://harfbuzz.github.io/harfbuzz-hb-buffer.html#hb_glyph_info_t but the current implementation is only returning clusters without the original unicode codepoints

There is other unused data inside of infos I thought it might be in

var infosPtr = exports.hb_buffer_get_glyph_infos(ptr, 0);
var infosPtr32 = infosPtr / 4;
var infos = heapu32.subarray(infosPtr32, infosPtr32 + 5 * length);

but none of this additional unreturned data seems to be what I'm looking for.

I'm looking to use the unicode codepoints from the original buffer to handle fallbacks for glyphs that are not available in the font. Currently the glyphId is returned as 0 for all of these cases which isn't very helpful.

I'm open to alternative ways of handling this if you all have better ideas for ways to connect the original text with the shaped output.

chearon commented 1 year ago

In the result array each object has cl for the UTF-16 offset into the string and g for the glyph ID in the font. If the glyph ID is 0 you need to do fallback on the unicode cluster it corresponds to. Maybe you were expecting cl to be a full code point value rather than an offset? If you post a full example I could be of better help.

talltyler commented 1 year ago

For now I've just been modifying the provided examples and trying to get strings with emojis to return something I can render. From my understanding cl is the cluster index which for single byte characters will work fine but characters like 👱🏽‍♂️ all of this breaks down because the cluster indexes become disconnected from the original buffer strings with the multibyte characters. I'd tried a few things to get your suggested approach to work by counting the byte length of different characters and modifying the cl values into the original strings index but there are a lot of edge cases. This isn't feeling like the correct path to take and I just want to check if maybe there is another way to access this data.

chearon commented 1 year ago

If I drag Noto Color Emoji into the harfbuzzjs demo and type "a👱🏽‍♂️b" into the input, it gives me this array:

[
  {"g":0,"cl":0,"ax":1245,"ay":0,"dx":0,"dy":0,"flags":0},
  {"g":2539,"cl":1,"ax":1245,"ay":0,"dx":0,"dy":0,"flags":0},
  {"g":0,"cl":8,"ax":1245,"ay":0,"dx":0,"dy":0,"flags":0}
]

Since '👱🏽‍♂️'.length === 7, all of those indices are correct. What are the disconnected indices you're seeing?