Unexpected results with characters beyond BMP

smuehlst commented 1 month ago

I'm using the icu-le-hb project with the ParagraphLayout class from ICU 75.1 and HarfBuzz 9.0.0, all built from source with Visual Studio 19.

While this mostly works as expected, I'm now running into a problem when I try to use the Noto Emoji font, where the expected glyph id is not produced for example for the U+1F602 smiley.

I know that HarfBuzz itself works correctly, as I can verify with the hb-shape and hb-view tools that the expected glyph is produced, e.g.:

hb-shape  --no-glyph-names --unicodes="U+01F602" NOTOEMOJI-REGULAR.TTF
[897=0+2600]

I verified that 897 is the expected glyph id for U+01F602. When HarfBuzz shapes the text in the context of my icu-le-hb build, I get glyph id 0.

So my first question is, how should I provide a standalone test case for icu-le-hb? As a modified copy of ICU's letest.cpp program?

behdad commented 1 month ago

If you provide a test program I can try to debug. Thanks

smuehlst commented 1 month ago

@behdad I asked the question because of the complicated multi-step build process to build ICU, icu-le-hb, ICU's layout engine and HarfBuzz according to https://unicode-org.github.io/icu/userguide/layoutengine/paragraph.html.

I have a custom automated build process for all the components and the letest.cpp program, but it won't work outside of my environment. That is why I asked about the letest.cpp program. Will you be able to build the test program if I provide a modified copy of the letest.cpp program and nothing else?

smuehlst commented 1 month ago

Below follows a minimal test case based on ICU's original letest.cpp.

The test uses HarfBuzz 9.0.0 built from source. The following Meson options are used when configuring the HarfBuzz build: -Dicu=disabled -Ddefault_library=static -Dbuildtype=debug.

Download the Noto Emoji font from https://fonts.google.com/noto/specimen/Noto+Emoji
Extract the NotoEmoji-Regular.ttf file from the ZIP file and save it to the directory from where the letest program is executed.
Build and run the letest program.

The output generated in function EmojiTest is this:

glyph id: 0 for U+1f602 (0, 0)
glyph id: 65535 IGNORED
glyph id: 170 for U+2764 (5.85938, 0)
final advance (21.0938, 0)

It is unexpected that glyph id 0 is produced for U+1f602 (provided in UTF-16 as 0xd83d, 0xde02). I would expect that glyph id 897 is produced, identically to what hb-shape generates.

Note that the output for the second test character U+2764 is as expected. Am I doing something wrong regarding Unicode values beyond the BMP?

/*
 * file name:  letest.cpp
 *
 * Derived from ICU's original letest.cpp program.
 * Test case for https://github.com/harfbuzz/icu-le-hb/issues/20
 */

#include "unicode/utypes.h"
#include "unicode/uclean.h"
#include "unicode/uchar.h"
#include "unicode/unistr.h"
#include "unicode/uscript.h"
#include "unicode/putil.h"
#include "unicode/ctest.h"
#include "unicode/utf16.h"

#include "layout/LETypes.h"
#include "layout/LEScripts.h"
#include "layout/LayoutEngine.h"

#include "layout/ParagraphLayout.h"
#include "layout/RunArrays.h"

#include "PortableFontInstance.h"
#include "SimpleFontInstance.h"

#include "letsutil.h"
#include "letest.h"

#include "putilimp.h" // for uprv_getUTCtime()

#include <stdlib.h>
#include <string.h>

#include <iostream>
#include <iomanip>

U_NAMESPACE_USE

U_CDECL_BEGIN
static void U_CALLCONV EmojiTest()
{
    LEErrorCode status = LE_NO_ERROR;

    char const fontfile_name[] = "NotoEmoji-Regular.ttf";

    // U+1F602, U+2764
    LEUnicode16 const emoji[] = {
        0xd83d, 0xde02, 0x2764
    };

    le_int32 const emoji_char_count = LE_ARRAY_SIZE(emoji);

    PortableFontInstance const font(fontfile_name, 12, status);
    if (LE_FAILURE(status))
    {
        std::cerr << "Instantiation of PortableFontInstance failed!" << std::endl;
    }
    else
    {
        float const lineWidth = 100;

        FontRuns fontRuns(0);

        fontRuns.add(&font, emoji_char_count);
        ParagraphLayout paragraphLayout(emoji, emoji_char_count, &fontRuns, nullptr, nullptr, nullptr, 0, false, status);

        if (LE_FAILURE(status))
        {
            std::cerr << "Instantiation of ParagraphLayout failed!" << std::endl;
        }
        else
        {
            le_int32 lineNumber = 1;
            paragraphLayout.reflow();
            const ParagraphLayout::Line *line;
            while ((line = paragraphLayout.nextLine(lineWidth)) != nullptr)
            {
                std::cout << "line " << lineNumber << std::endl;

                le_int32 runCount = line->countRuns();

                for (le_int32 run = 0; run < runCount; run += 1) {
                    const ParagraphLayout::VisualRun * const visualRun = line->getVisualRun(run);
                    le_int32 const glyphCount = visualRun->getGlyphCount();
                    const le_int32 * const glyphToCharMap = visualRun->getGlyphToCharMap();
                    LEGlyphID const * const glyphs = visualRun->getGlyphs();
                    float const * const positions = visualRun->getPositions();

                    for(le_int32 i = 0; i < glyphCount; i += 1) {
                        LEGlyphID const& g = glyphs[i];
                        std::cout << "glyph id: " << g << " ";

                        if (g == 0xFFFE || g == 0xFFFF)
                        {
                            std::cout << "IGNORED" << std::endl;
                        }
                        else
                        {
                            le_int32 ix = glyphToCharMap[i];

                            // Resolve potential surrogate pairs.
                            UChar32 u32_char;
                            U16_NEXT(emoji, ix, emoji_char_count, u32_char);

                            std::cout << "for U+" << std::setw(4) << std::hex << u32_char << std::dec
                                << " (" << positions[i * 2] << ", " << positions[i * 2 + 1] << ")" << std::endl;
                        }
                    }
                    std::cout << "final advance ("
                        << positions[glyphCount * 2] << ", " << positions[glyphCount * 2 + 1] << ")" << std::endl;
                }

                lineNumber += 1;
            }
        }
    }
}
U_CDECL_END

static void addAllTests(TestNode **root)
{
    addTest(root, &EmojiTest, "paragraph/EmojiTest");

#ifndef USING_ICULEHB
    addCTests(root);
#endif
}

/* returns the path to icu/source/data/out */
static const char *ctest_dataOutDir()
{
    static const char *dataOutDir = nullptr;

    if(dataOutDir) {
        return dataOutDir;
    }

    /* U_TOPBUILDDIR is set by the makefiles on UNIXes when building cintltst and intltst
    //              to point to the top of the build hierarchy, which may or
    //              may not be the same as the source directory, depending on
    //              the configure options used.  At any rate,
    //              set the data path to the built data from this directory.
    //              The value is complete with quotes, so it can be used
    //              as-is as a string constant.
    */
#if defined (U_TOPBUILDDIR)
    {
        dataOutDir = U_TOPBUILDDIR "data" U_FILE_SEP_STRING "out" U_FILE_SEP_STRING;
    }
#else

    /* On Windows, the file name obtained from __FILE__ includes a full path.
     *             This file is "wherever\icu\source\test\cintltst\cintltst.c"
     *             Change to    "wherever\icu\source\data"
     */
    {
        static char p[sizeof(__FILE__) + 20];
        char *pBackSlash;
        int i;

        strcpy(p, __FILE__);
        /* We want to back over three '\' chars.                            */
        /*   Only Windows should end up here, so looking for '\' is safe.   */
        for (i=1; i<=3; i++) {
            pBackSlash = strrchr(p, U_FILE_SEP_CHAR);
            if (pBackSlash != nullptr) {
                *pBackSlash = 0;        /* Truncate the string at the '\'   */
            }
        }

        if (pBackSlash != nullptr) {
            /* We found and truncated three names from the path.
             *  Now append "source\data" and set the environment
             */
            strcpy(pBackSlash, U_FILE_SEP_STRING "data" U_FILE_SEP_STRING "out" U_FILE_SEP_STRING);
            dataOutDir = p;
        }
        else {
            /* __FILE__ on MSVC7 does not contain the directory */
            FILE *file = fopen(".." U_FILE_SEP_STRING ".." U_FILE_SEP_STRING "data" U_FILE_SEP_STRING "Makefile.in", "r");
            if (file) {
                fclose(file);
                dataOutDir = ".." U_FILE_SEP_STRING ".." U_FILE_SEP_STRING "data" U_FILE_SEP_STRING "out" U_FILE_SEP_STRING;
            }
            else {
                dataOutDir = ".." U_FILE_SEP_STRING".." U_FILE_SEP_STRING".." U_FILE_SEP_STRING "data" U_FILE_SEP_STRING "out" U_FILE_SEP_STRING;
            }
        }
    }
#endif

    return dataOutDir;
}

/*  ctest_setICU_DATA  - if the ICU_DATA environment variable is not already
 *                       set, try to deduce the directory in which ICU was built,
 *                       and set ICU_DATA to "icu/source/data" in that location.
 *                       The intent is to allow the tests to have a good chance
 *                       of running without requiring that the user manually set
 *                       ICU_DATA.  Common data isn't a problem, since it is
 *                       picked up via a static (build time) reference, but the
 *                       tests dynamically load some data.
 */
static void ctest_setICU_DATA() {

    /* No location for the data dir was identifiable.
     *   Add other fallbacks for the test data location here if the need arises
     */
    if (getenv("ICU_DATA") == nullptr) {
        /* If ICU_DATA isn't set, set it to the usual location */
        u_setDataDirectory(ctest_dataOutDir());
    }
}

int main(int argc, char* argv[])
{
    int32_t nerrors = 0;
    TestNode *root = nullptr;
    UErrorCode errorCode = U_ZERO_ERROR;
    UDate startTime, endTime;
    int32_t diffTime;

    startTime = uprv_getUTCtime();

    if (!initArgs(argc, argv, nullptr, nullptr)) {
        /* Error already displayed. */
        return -1;
    }

    /* Check whether ICU will initialize without forcing the build data directory into
    *  the ICU_DATA path.  Success here means either the data dll contains data, or that
    *  this test program was run with ICU_DATA set externally.  Failure of this check
    *  is normal when ICU data is not packaged into a shared library.
    *
    *  Whether or not this test succeeds, we want to cleanup and reinitialize
    *  with a data path so that data loading from individual files can be tested.
    */
    u_init(&errorCode);

    if (U_FAILURE(errorCode)) {
        fprintf(stderr,
            "#### Note:  ICU Init without build-specific setDataDirectory() failed.\n");
    }

    u_cleanup();
    errorCode = U_ZERO_ERROR;

    if (!initArgs(argc, argv, nullptr, nullptr)) {
        /* Error already displayed. */
        return -1;
    }
/* Initialize ICU */
    ctest_setICU_DATA();    /* u_setDataDirectory() must happen Before u_init() */
    u_init(&errorCode);

    if (U_FAILURE(errorCode)) {
        fprintf(stderr,
            "#### ERROR! %s: u_init() failed with status = \"%s\".\n" 
            "*** Check the ICU_DATA environment variable and \n"
            "*** check that the data files are present.\n", argv[0], u_errorName(errorCode));
        return 1;
    }

    addAllTests(&root);
    nerrors = runTestRequest(root, argc, argv);

    cleanUpTestTree(root);
    u_cleanup();

    endTime = uprv_getUTCtime();
    diffTime = (int32_t)(endTime - startTime);
    printf("Elapsed Time: %02d:%02d:%02d.%03d\n",
        (int)((diffTime%U_MILLIS_PER_DAY)/U_MILLIS_PER_HOUR),
        (int)((diffTime%U_MILLIS_PER_HOUR)/U_MILLIS_PER_MINUTE),
        (int)((diffTime%U_MILLIS_PER_MINUTE)/U_MILLIS_PER_SECOND),
        (int)(diffTime%U_MILLIS_PER_SECOND));

    return nerrors;
}

behdad commented 1 month ago

Thanks. I'll try to look into this next week.

behdad commented 1 month ago

Note that the output for the second test character U+2764 is as expected. Am I doing something wrong regarding Unicode values beyond the BMP?

Probably ours, not yours. But yeah that was gonna be my next question. Does anything beyond BMP work at all?

behdad commented 1 month ago

I tried to build but is too much work as you suggested.

Any chance you can dig into icu-le-hb and see what we get into the hb_buffer_t after the hb_buffer_add_utf16 call?

smuehlst commented 1 month ago

Any chance you can dig into icu-le-hb and see what we get into the hb_buffer_t after the hb_buffer_add_utf16 call?

Sure, I will do that on Monday.

smuehlst commented 1 month ago

Any chance you can dig into icu-le-hb and see what we get into the hb_buffer_t after the hb_buffer_add_utf16 call?

There are two calls to hb_buffer_add_utf16() in LayoutEngine::layoutChars() before the call to hb_shape():

https://github.com/harfbuzz/icu-le-hb/blob/784717d90dd90aa54518169c4db4ed4628705663/src/LayoutEngine.cpp#L272-L273

I don't know what is actually relevant in the hb_buffer_t data structure, therefore this is a copy of the whole data structure dumped from the VS Code debugger after the second call to hb_buffer_add_utf16(). I didn't find a way to preserve the indentation when copying the data, so I hope you can still make sense of it:

*this->fHbBuffer
{header={ref_count={ref_count={...} } writable={v=1 } user_data={v=0x0000000000000000 {...} } } unicode=...}
header: {ref_count={ref_count={v=1 } } writable={v=1 } user_data={v=0x0000000000000000 <NULL> } }
ref_count: {ref_count={v=1 } }
ref_count: {v=1 }
v: 1
writable: {v=1 }
v: 1
user_data: {v=0x0000000000000000 <NULL> }
v: 0x0000000000000000 <NULL>
lock: <struct at NULL>
m: 0x0000000000000000 <NULL>
items: {items={allocated=??? length=??? arrayZ=??? } }
items: {allocated=??? length=??? arrayZ=??? }
allocated: <Unable to read memory>
length: <Unable to read memory>
arrayZ: <Unable to read memory>
unicode: 0x000002e5f1528370 {header={ref_count={ref_count={...} } writable={v=0 } user_data={v=0x0000000000000000 {...} } } ...}
header: {ref_count={ref_count={v=2 } } writable={v=0 } user_data={v=0x0000000000000000 <NULL> } }
ref_count: {ref_count={v=2 } }
ref_count: {v=2 }
v: 2
writable: {v=0 }
v: 0
user_data: {v=0x0000000000000000 <NULL> }
v: 0x0000000000000000 <NULL>
parent: 0x00007ff97c3bc640 {icu-le-hb.dll!hb_unicode_funcs_t _hb_Null_hb_unicode_funcs_t} {header={ref_count=...} ...}
header: {ref_count={ref_count={v=0 } } writable={v=0 } user_data={v=0x0000000000000000 <NULL> } }
ref_count: {ref_count={v=0 } }
ref_count: {v=0 }
v: 0
writable: {v=0 }
v: 0
user_data: {v=0x0000000000000000 <NULL> }
v: 0x0000000000000000 <NULL>
lock: <struct at NULL>
m: 0x0000000000000000 <NULL>
items: {items={allocated=??? length=??? arrayZ=??? } }
parent: 0x0000000000000000 <NULL>
func: {combining_class=0x00007ff97c1a91a0 {icu-le-hb.dll!hb_unicode_combining_class_nil(hb_unicode_funcs_t *, unsigned int, void *)} ...}
combining_class: 0x00007ff97c1a91a0 {icu-le-hb.dll!hb_unicode_combining_class_nil(hb_unicode_funcs_t *, unsigned int, void *)}
eastasian_width: 0x00007ff97c1a91c0 {icu-le-hb.dll!hb_unicode_eastasian_width_nil(hb_unicode_funcs_t *, unsigned int, void *)}
general_category: 0x00007ff97c1a91e0 {icu-le-hb.dll!hb_unicode_general_category_nil(hb_unicode_funcs_t *, unsigned int, void *)}
mirroring: 0x00007ff97c1a9200 {icu-le-hb.dll!hb_unicode_mirroring_nil(hb_unicode_funcs_t *, unsigned int, void *)}
script: 0x00007ff97c1a9220 {icu-le-hb.dll!hb_unicode_script_nil(hb_unicode_funcs_t *, unsigned int, void *)}
compose: 0x00007ff97c1a9240 {icu-le-hb.dll!hb_unicode_compose_nil(hb_unicode_funcs_t *, unsigned int, unsigned int, unsigned int *, void *)}
decompose: 0x00007ff97c1a9260 {icu-le-hb.dll!hb_unicode_decompose_nil(hb_unicode_funcs_t *, unsigned int, unsigned int *, unsigned int *, void *)}
decompose_compatibility: 0x00007ff97c1a9280 {icu-le-hb.dll!hb_unicode_decompose_compatibility_nil(hb_unicode_funcs_t *, unsigned int, unsigned int *, void *)}
user_data: {combining_class=0x0000000000000000 eastasian_width=0x0000000000000000 general_category=0x0000000000000000 ...}
combining_class: 0x0000000000000000
eastasian_width: 0x0000000000000000
general_category: 0x0000000000000000
mirroring: 0x0000000000000000
script: 0x0000000000000000
compose: 0x0000000000000000
decompose: 0x0000000000000000
decompose_compatibility: 0x0000000000000000
destroy: {combining_class=0x0000000000000000 eastasian_width=0x0000000000000000 general_category=0x0000000000000000 ...}
combining_class: 0x0000000000000000
eastasian_width: 0x0000000000000000
general_category: 0x0000000000000000
mirroring: 0x0000000000000000
script: 0x0000000000000000
compose: 0x0000000000000000
decompose: 0x0000000000000000
decompose_compatibility: 0x0000000000000000
func: {combining_class=0x00007ff97c2fb120 {icu-le-hb.dll!hb_ucd_combining_class(hb_unicode_funcs_t *, unsigned int, void *)} ...}
combining_class: 0x00007ff97c2fb120 {icu-le-hb.dll!hb_ucd_combining_class(hb_unicode_funcs_t *, unsigned int, void *)}
eastasian_width: 0x00007ff97c1a91c0 {icu-le-hb.dll!hb_unicode_eastasian_width_nil(hb_unicode_funcs_t *, unsigned int, void *)}
general_category: 0x00007ff97c2fb150 {icu-le-hb.dll!hb_ucd_general_category(hb_unicode_funcs_t *, unsigned int, void *)}
mirroring: 0x00007ff97c2fb180 {icu-le-hb.dll!hb_ucd_mirroring(hb_unicode_funcs_t *, unsigned int, void *)}
script: 0x00007ff97c2fb1b0 {icu-le-hb.dll!hb_ucd_script(hb_unicode_funcs_t *, unsigned int, void *)}
compose: 0x00007ff97c2fb4a0 {icu-le-hb.dll!hb_ucd_compose(hb_unicode_funcs_t *, unsigned int, unsigned int, unsigned int *, void *)}
decompose: 0x00007ff97c2fb770 {icu-le-hb.dll!hb_ucd_decompose(hb_unicode_funcs_t *, unsigned int, unsigned int *, unsigned int *, void *)}
decompose_compatibility: 0x00007ff97c1a9280 {icu-le-hb.dll!hb_unicode_decompose_compatibility_nil(hb_unicode_funcs_t *, unsigned int, unsigned int *, void *)}
user_data: {combining_class=0x0000000000000000 eastasian_width=0x0000000000000000 general_category=0x0000000000000000 ...}
combining_class: 0x0000000000000000
eastasian_width: 0x0000000000000000
general_category: 0x0000000000000000
mirroring: 0x0000000000000000
script: 0x0000000000000000
compose: 0x0000000000000000
decompose: 0x0000000000000000
decompose_compatibility: 0x0000000000000000
destroy: {combining_class=0x0000000000000000 eastasian_width=0x0000000000000000 general_category=0x0000000000000000 ...}
combining_class: 0x0000000000000000
eastasian_width: 0x0000000000000000
general_category: 0x0000000000000000
mirroring: 0x0000000000000000
script: 0x0000000000000000
compose: 0x0000000000000000
decompose: 0x0000000000000000
decompose_compatibility: 0x0000000000000000
flags: 3
cluster_level: HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS (1)
replacement: 65533
invisible: 0
not_found: 0
content_type: HB_BUFFER_CONTENT_TYPE_UNICODE (1)
props: {direction=HB_DIRECTION_LTR (4) script=HB_SCRIPT_COMMON (1517910393) language=0x000002e5f151d100 {s=...} ...}
direction: HB_DIRECTION_LTR (4)
script: HB_SCRIPT_COMMON (1517910393)
language: 0x000002e5f151d100 {s=0x000002e5f151d100 "de" }
s: 0x000002e5f151d100 "de"
reserved1: 0x0000000000000000
reserved2: 0x0000000000000000
successful: true
shaping_failed: false
have_output: false
have_positions: false
idx: 0
len: 2
out_len: 0
allocated: 32
info: 0x000002e5f1544c70 {codepoint=128514 mask=0 cluster=0 ...}
codepoint: 128514
mask: 0
cluster: 0
var1: {u32=0 i32=0 u16=0x000002e5f1544c7c {0, 0} ...}
u32: 0
i32: 0
u16: 0x000002e5f1544c7c {0, 0}
i16: 0x000002e5f1544c7c {0, 0}
u8: 0x000002e5f1544c7c ""
i8: 0x000002e5f1544c7c ""
var2: {u32=0 i32=0 u16=0x000002e5f1544c80 {0, 0} ...}
u32: 0
i32: 0
u16: 0x000002e5f1544c80 {0, 0}
i16: 0x000002e5f1544c80 {0, 0}
u8: 0x000002e5f1544c80 ""
i8: 0x000002e5f1544c80 ""
out_info: 0x000002e5f1544c70 {codepoint=128514 mask=0 cluster=0 ...}
codepoint: 128514
mask: 0
cluster: 0
var1: {u32=0 i32=0 u16=0x000002e5f1544c7c {0, 0} ...}
u32: 0
i32: 0
u16: 0x000002e5f1544c7c {0, 0}
i16: 0x000002e5f1544c7c {0, 0}
u8: 0x000002e5f1544c7c ""
i8: 0x000002e5f1544c7c ""
var2: {u32=0 i32=0 u16=0x000002e5f1544c80 {0, 0} ...}
u32: 0
i32: 0
u16: 0x000002e5f1544c80 {0, 0}
i16: 0x000002e5f1544c80 {0, 0}
u8: 0x000002e5f1544c80 ""
i8: 0x000002e5f1544c80 ""
pos: 0x000002e5f15449b0 {x_advance=-842150451 y_advance=-842150451 x_offset=-842150451 ...}
x_advance: -842150451
y_advance: -842150451
x_offset: -842150451
y_offset: -842150451
var: {u32=3452816845 i32=-842150451 u16=0x000002e5f15449c0 {52685, 52685} ...}
u32: 3452816845
i32: -842150451
u16: 0x000002e5f15449c0 {52685, 52685}
i16: 0x000002e5f15449c0 {-12851, -12851}
u8: 0x000002e5f15449c0 "ÍÍÍÍ...
i8: 0x000002e5f15449c0 "ÍÍÍÍ...
context: 0x000002e5f15241f0 {0x000002e5f15241f0 {0, 0, 0, 0, 0}, 0x000002e5f1524204 {128514, 10084, 0, 0, 0}}
[0]: 0x000002e5f15241f0 {0, 0, 0, 0, 0}
[1]: 0x000002e5f1524204 {128514, 10084, 0, 0, 0}
context_len: 0x000002e5f1524218 {0, 0}
[0]: 0
[1]: 0
allocated_var_bits: 0 '\0'
serial: 0 '\0'
random_state: 1
scratch_flags: HB_BUFFER_SCRATCH_FLAG_DEFAULT (0)
max_len: 1073741823
max_ops: 536870911
message_func: 0x0000000000000000
message_data: 0x0000000000000000
message_destroy: 0x0000000000000000
message_depth: 0

smuehlst commented 1 month ago

Does anything beyond BMP work at all?

As a second test I tried the Noto Sans Linear B font (https://fonts.google.com/noto/specimen/Noto+Sans+Linear+B) with characters U+10000 and U+10001 (0xD800, 0xDC00, 0xD800, 0xDC01 as UTF-16 in the test program). This also results in glyph id 0 being produced for both characters.

So up to now I didn't see anything working as expected beyond the BMP.

smuehlst commented 1 month ago

I think I have found the problem. icu-le-hb calls ICU's PortableFontInstance::mapCharToGlyph() to get the glyph id for a Unicode value, and that calls CMAPFormat4Mapper::unicodeToGlyph(), where characters beyond the BMP are unconditionally mapped to glyph id 0:

https://github.com/unicode-org/icu/blob/94c309d9a646521bbed7b9ee99b768097fcea622/icu4c/source/test/letest/cmaps.cpp#L169-L175

LEGlyphID CMAPFormat4Mapper::unicodeToGlyph(LEUnicode32 unicode32) const
{
    if (unicode32 >= 0x10000) {
        return 0;
    }

    LEUnicode16 unicode = static_cast<LEUnicode16>(unicode32);

Obviously this is not a problem in icu-le-hb, but in ICU's letest application. I'm sorry for suspecting a problem in icu-le-hb initially.

What is unclear to me is why the ICU code does the glyph lookup. Couldn't that be done by HarfBuzz itself, as HarfBuzz standalone of course is able to map the Unicode values to glyphs correctly? But I guess this has historical reasons when icu-le-hb was implemented as a drop-in replacement for ICU's layout engine.

smuehlst commented 1 month ago

Couldn't that be done by HarfBuzz itself, as HarfBuzz standalone of course is able to map the Unicode values to glyphs correctly?

To ask a better question: Could I use the HarfBuzz API to implement the Unicode to glyph id lookup myself and replace the current glyph lookup implementation of the letest application with that? I'm looking for example at the hb-face function hb_face_collect_nominal_glyph_mapping().

smuehlst commented 1 month ago

To ask a better question: Could I use the HarfBuzz API to implement the Unicode to glyph id lookup myself...

This is actually very simple by using the function hb_font_get_nominal_glyph().

I now get the expected results also for non-BMP characters, so the problem is solved. Sorry again for opening this issue against the wrong component, and thanks for looking at the problem.

behdad commented 1 month ago

Thanks for digging into this. So, where does the fix need to go?

behdad commented 1 month ago

What happens if you completely remove this line:

hb_font_set_funcs (fHbFont, icu_le_hb_get_font_funcs (), (void *) fontInstance, NULL);

from LayoutEngine.cpp?

smuehlst commented 1 month ago

So, where does the fix need to go?

In a discussion in the HarfBuzz repository you had earlier advised me that I need to load the font another time if I want to use the hb-draw api:

https://github.com/harfbuzz/harfbuzz/discussions/4819#discussioncomment-10205061

As a consequence of that I extended my copy of the ICU class PortableFontInstance with code to load the font another time, and I have a new member hb_font_t *fHbFont; in that class now.

The fixfor me regarding mapping of the non-BMP characters to glyph ids is a change in the method LEGlyphID PortableFontInstance::mapCharToGlyph(LEUnicode32 ch) const. Originally it looks like this:

https://github.com/unicode-org/icu/blob/9270216cb73faa128a9d77a14175fc4a164747ad/icu4c/source/test/letest/PortableFontInstance.cpp#L453-L456

LEGlyphID PortableFontInstance::mapCharToGlyph(LEUnicode32 ch) const
{
    return fCMAPMapper->unicodeToGlyph(ch);
}

Now it looks like this:

LEGlyphID PortableFontInstance::mapCharToGlyph(LEUnicode32 ch) const
{
    hb_codepoint_t glyph_id = 0;
    LEGlyphID result = 0;
    if (hb_font_get_nominal_glyph(fHbFont, static_cast<hb_codepoint_t>(ch), &glyph_id))
    {
        result = static_cast<LEGlyphID>(glyph_id);
    }

    return result;
}

After that change the ICU class CMAPMapper and its derived classes are not needed anymore. I removed the member CMAPMapper *fCMAPMapper; from my copy of the PortableFontInstance class and I removed the following source files from my build completely:

https://github.com/unicode-org/icu/blob/main/icu4c/source/test/letest/cmaps.h https://github.com/unicode-org/icu/blob/main/icu4c/source/test/letest/cmaps.cpp

I think this change is not a fix that can be applied in general to ICU's letest implementation. ICU's letest implementation is more a demo application for the ParagraphLayout class. It was my mistake that I thought it could be used more or less unchanged as production code. Although ICU strongly recommends to use icu-le-hb as a replacement for the old ICU layout engine, it probably wants to keep the code in the layoutex module agnostic of the actual layout engine.

smuehlst commented 1 month ago

Another question that I have now is whether it is not possible to implement all the virtual methods of the LEFontInstance class by means of HarfBuzz's hb_font_t functions. If that is possible, it would be highly useful if icu-le-hb would bring a corresponding class derived from LEFontInstance with it, e.g. a class HbFontInstance.

Do the hb_font_t functions have all the functionality for the needs of the LEFontInstance interface? If you can give me directions on that, I might be able to implement that and to contribute it.

smuehlst commented 1 month ago

What happens if you completely remove this line:
hb_font_set_funcs (fHbFont, icu_le_hb_get_font_funcs (), (void *) fontInstance, NULL);
from LayoutEngine.cpp?

If I go back to the state without my modification of PortableFontInstance::mapCharToGlyph() and disable the call to hb_font_set_funcs (), then the Unicode character U+1f602 can be mapped successfully to glyph id 897 in the test with the Noto Emoji font.

But other tests in the letest program fail then:

   DataDrivenTest   {
!!   Test Ghita: incorrect x position for glyph 1: expected 8.495850, got 8.496094
!!   Test Arabic: incorrect x position for glyph 1: expected 5.831787, got 5.832031
!!   Test Unicode Arabic: incorrect x position for glyph 1: expected 5.831787, got 5.832031
!!   Test Thai: incorrect x position for glyph 1: expected 7.247803, got 7.248047
!!   Test Arabic Simple: incorrect x position for glyph 1: expected 5.831787, got 5.832031
!!   Test Matra Test: incorrect x position for glyph 3: expected 6.119873, got 6.120117
!!   Test Deva Stress Test: incorrect x position for glyph 2: expected 9.215820, got 9.216064
!!   Test Deva Test: incorrect x position for glyph 3: expected 12.443848, got 12.444092
!!   Test Deva locl Hindi: incorrect x position for glyph 1: expected 7.571777, got 7.572021
!!   Test Deva locl Marathi: incorrect x position for glyph 1: expected 7.391846, got 7.392090
!!   Test Deva ZWJ: incorrect x position for glyph 1: expected 8.603760, got 8.604004
!!   Test Hebrew Mark Test: incorrect x position for glyph 2: expected 6.011719, got 6.011963
!!   Test Not Language Specific: incorrect x position for glyph 1: expected 6.671875, got 6.672119
!!   Test Romanian Language Specific: incorrect x position for glyph 1: expected 6.671875, got 6.672119
!!   Test Nafees Nastaleeq Cursive Positioning Test: incorrect x position for glyph 3: expected 17.999756, got 18.000000
!!   Test Malayalam Crash Test 9948: incorrect x position for glyph 2: expected 21.419678, got 21.419922
!!   Test Malayalam Crash II: incorrect x position for glyph 3: expected 21.263672, got 21.263916
!!   Test Malayalam Samvruthokaram Test: incorrect x position for glyph 1: expected 16.415771, got 16.416016
!!   Test Angsana New Mark Test: incorrect x position for glyph 1: expected 7.247803, got 7.248047
!!   Test Arabic Presentation Forms LRO Test: incorrect x position for glyph 2: expected 8.831787, got 8.832031
!!   Test Arabic Presentation Forms No LRO Test: incorrect x position for glyph 1: expected 3.227783, got 3.228027
   } ---[21 ERRORS in /layout/DataDrivenTest]  (34ms)

behdad commented 1 month ago

What if you just remove hb_font_funcs_set_nominal_glyph_func?

I can change icu-le-hb to try using the HB funcs if the LEFontInstance fails to map the character. That would fix your case.

behdad commented 1 month ago

What if you just remove hb_font_funcs_set_nominal_glyph_func?

I can change icu-le-hb to try using the HB funcs if the LEFontInstance fails to map the character. That would fix your case.

Like this:

diff --git a/src/LayoutEngine.cpp b/src/LayoutEngine.cpp
index 73f717c..83fe5b7 100644
--- a/src/LayoutEngine.cpp
+++ b/src/LayoutEngine.cpp
@@ -65,6 +65,10 @@ icu_le_hb_font_get_glyph (hb_font_t *font,
     const LEFontInstance *fontInstance = (const LEFontInstance *) font_data;

     *glyph = fontInstance->mapCharToGlyph (unicode);
+
+    if (!*glyph)
+      return hb_font_get_nominal_glyph (font, unicode, glyph);
+
     return !!*glyph;
 }

behdad commented 1 month ago

But other tests in the letest program fail then:

These all seem to be noise-level differences in the advance... I wonder if we should just always use HB's own implementation and use LEFontInstance just for loading tables. That would definitely simplify the users' life.

smuehlst commented 1 month ago

What if you just remove hb_font_funcs_set_nominal_glyph_func?

I will try that tomorrow at work.

I wonder if we should just always use HB's own implementation and use LEFontInstance just for loading tables. That would definitely simplify the users' life.

That is what I tried to suggest in my comment above with a class HbFontInstance that implements all the necessary virtual methods of LEFontInstance by means of HarfBuzz: https://github.com/harfbuzz/icu-le-hb/issues/20#issuecomment-2342895619

behdad commented 1 month ago

That is what I tried to suggest in my comment above with a class HbFontInstance that implements all the necessary virtual methods of LEFontInstance by means of HarfBuzz: #20 (comment)

No need for that even. We can just turn to HarfBuzz by removing the set_font_funcs completely...

smuehlst commented 1 month ago

What if you just remove hb_font_funcs_set_nominal_glyph_func?

I went back to the state before my fix described in https://github.com/harfbuzz/icu-le-hb/issues/20#issuecomment-2342865094 and removed the following call from icu_le_hb_get_font_funcs() in LayoutEngine.cpp:

hb_font_funcs_set_nominal_glyph_func (f, icu_le_hb_font_get_glyph, NULL, NULL);

After that mapping Unicode values to glyph ids is consistently broken. All Unicode characters are mapped to glyph id 0.

smuehlst commented 1 month ago

What if you just remove hb_font_funcs_set_nominal_glyph_func?

I can change icu-le-hb to try using the HB funcs if the LEFontInstance fails to map the character. That would fix your case.

When I apply the patch from https://github.com/harfbuzz/icu-le-hb/issues/20#issuecomment-2344539639 in addition to the removal of the call to hb_font_funcs_set_nominal_glyph_func(), the mapping is still broken. Because the call to hb_font_funcs_set_nominal_glyph_func (f, icu_le_hb_font_get_glyph, NULL, NULL); was disabled, icu_le_hb_font_get_glyph() will never be called anyway, so the patch has no effect.

So I assume that you maybe did mean to apply the patch from https://github.com/harfbuzz/icu-le-hb/issues/20#issuecomment-2344539639 without disabling the call to hb_font_funcs_set_nominal_glyph_func (f, icu_le_hb_font_get_glyph, NULL, NULL);?

If I do that it causes a stack overflow because of an endless recursion in one of the test cases in letest.cpp::ParamTest():

https://github.com/unicode-org/icu/blob/752da7303fa2e55a5f358400674bb3c00429b83a/icu4c/source/test/letest/letest.cpp#L185

    glyphCount = engine->layoutChars(chars, 8, 6, 20, true, 0.0, 0.0, status);

smuehlst commented 1 month ago

That is what I tried to suggest in my comment above with a class HbFontInstance that implements all the necessary virtual methods of LEFontInstance by means of HarfBuzz: #20 (comment)

No need for that even. We can just turn to HarfBuzz by removing the set_font_funcs completely...

That would be my preferred solution then. Would that mean that I could get rid of all the code related to font parsing and font table caching in the original letest application?

behdad commented 1 month ago

That is what I tried to suggest in my comment above with a class HbFontInstance that implements all the necessary virtual methods of LEFontInstance by means of HarfBuzz: #20 (comment)

No need for that even. We can just turn to HarfBuzz by removing the set_font_funcs completely...

That would be my preferred solution then. Would that mean that I could get rid of all the code related to font parsing and font table caching in the original letest application?

You still need the font table caching, but all other font parsing code can go. Do you want to give that a try?

smuehlst commented 1 month ago

You still need the font table caching, but all other font parsing code can go. Do you want to give that a try?

Yes.

Are any other changes in icu-le-hb necessary apart from removing the call to hb_font_set_funcs () from LayoutEngine.cpp?

behdad commented 1 month ago

Are any other changes in icu-le-hb necessary apart from removing the call to hb_font_set_funcs () from LayoutEngine.cpp?

I think that should be it.

smuehlst commented 1 month ago

Are any other changes in icu-le-hb necessary apart from removing the call to hb_font_set_funcs () from LayoutEngine.cpp?

I think that should be it.

Ok, I will try that tomorow then, thanks!

smuehlst commented 1 month ago

Are any other changes in icu-le-hb necessary apart from removing the call to hb_font_set_funcs () from LayoutEngine.cpp?

I think that should be it.

I removed the call to hb_font_set_funcs () from LayoutEngine.cpp together with the related functions that are referenced from there.

This is working fine now. As a consequence I was also able to remove most of the font parsing code from PortableFontInstance.

behdad commented 1 month ago

@khaledhosny Any chance you can make a release of this?

khaledhosny commented 1 month ago

@khaledhosny Any chance you can make a release of this?

https://github.com/harfbuzz/icu-le-hb/releases/tag/2.0.0

harfbuzz / icu-le-hb

Unexpected results with characters beyond BMP #20