PushID() performance - Githubissues

wolfpld commented 4 years ago

Function ImHashData() has comment FIXME-OPT: Replace with e.g. FNV1a hash? CRC32 pretty much randomly access 1KB. Need to do proper measurements.. I do happen to have such measurements.

Symbol PushID is constructed using the base function and inlined functions ImHashData(), ImVector<>::back(), ImVector<>::push_back() and ImGuiWindow::GetIDNoKeepAlive(). 91.6% of random PC samples hitting this symbol are in the ImHashData() function.

Calling ImHashData() on 32-bit systems is a bit pointless, as it just shuffles input bits and returns a 32-bit value.

On 64-bit systems the 64-bit input value is reduced to a 32-bit hash. This would not be needed, if ImGuiID would have native pointer size.

Note: PushID() shows rather high in the profiling statistics, but I wouldn't consider it a hot spot.

ocornut commented 4 years ago

Thank you Bartosz.

Note: PushID() shows rather high in the profiling statistics, but I wouldn't consider it a hot spot.

Do you have even ballpark figure for your test case (e.g I assume Tracy itself?). There are three versions of PushID() and the ones working on strings are using ImHashStr() with variable length data, where your comment on hashing small values (32 or 64 bits) may not be applicable the same way. I don't have Metrics so I'd assume a typical app would hash more strings that int/void* sized elements. Even so, I'm not sure what you imply with the "it just shuffles input bits and returns a 32-bit value." comment.

Musing about the possibility of changing ImGuiID storage:

I always intuitively expected there would be value in keeping hashed ID value stables across builds (32-bit vs 64-bit), for debugging, etc. but right now I can't backup this claim very far. I'm expecting that an hypothetical/desirable macro recording feature of automation system may need to save hashed id on disk as a fallback when unable to identify the source of id stack, element but even that feature would severely fail if pointers are involved so can't really be relied upon.

In 2020 it may also be reasonable to have a bias toward 64-bit archs, and say, define ImGuIID as always 64-bit if there is a benefit to be gotten from it.

wolfpld commented 4 years ago

There are three versions of PushID() and the ones working on strings are using ImHashStr() with variable length data, where your comment on hashing small values (32 or 64 bits) may not be applicable the same way. I don't have Metrics so I'd assume a typical app would hash more strings that int/void* sized elements. Even so, I'm not sure what you imply with the "it just shuffles input bits and returns a 32-bit value." comment.

I'm using PushID(const void* ptr_id) here (and the input data is a pointer). On 32-bit systems pointer size is 32-bit, so calculating the hash transforms a 32-bit pointer into a 32-bit int. Looking more closely at the code, this bit shuffling even happens in the PushID(int) case, in ImGuiWindow::GetIDNoKeepAlive(int).

Do you have even ballpark figure for your test case (e.g I assume Tracy itself?).

Yes, this was Tracy self-profile session, targetting slowdown when a relatively large list of messages was displayed. The code looked something like:

for ptr in msgs:
    PushID( ptr )
    Selectable( ptr->time )
    NextColumn()
    TextWrapped( ptr->msg )
    ...

And the statistical profling showed the following (each entry is a symbol, and PushID is also expanded to show individual measurements in its inlined functions):

obraz

(I have since resigned from wrapping, as it prevented clipper usage.)

In 2020 it may also be reasonable to have a bias toward 64-bit archs, and say, define ImGuIID as always 64-bit if there is a benefit to be gotten from it.

Can't really comment on pros and cons about that, but replacing crc32 with e.g. xxh3 seems like a sure bet.

http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html

ocornut commented 4 years ago

Thanks for the details.

On 32-bit systems pointer size is 32-bit, so calculating the hash transforms a 32-bit pointer into a 32-bit int. Looking more closely at the code, this bit shuffling even happens in the PushID(int) case, in ImGuiWindow::GetIDNoKeepAlive(int).

That's correct, but it seems desirable otherwise those functions would have no point? Remember that all those ImHashXXX data are called with top of the stack as hash seed, so I'm not sure what you imply should be changed there.

I have since resigned from wrapping, as it prevented clipper usage

Yes that makes things a bit more complex :( I'd like to eventually work on exploring solutions to facilitate clipping in wider scenarios.

wolfpld commented 4 years ago

Remember that all those ImHashXXX data are called with top of the stack as hash seed, so I'm not sure what you imply should be changed there.

This may be the important bit that I've missed.

Anyways, some quickly hacked benchmarks, doing 1 bln PushID + PopID. Current code:

obraz

Pass pointer by value to ImHashData, don't read byte-by-byte (very slow on all modern uarchs), use logical and and shift instead:

obraz

Using XXH3 resulted in inlining PushID into parent function (showing inline functions here, not grouped into symbols; XXH3_rrmxmx and XXH_xorshift64 are measured correctly, as two different functions, even if the identical time would indicate otherwise).

obraz

ocornut commented 4 years ago

Thanks.

If you have a patch to share (even pasting func as is) for the optimized crc32 hash I’d be interested.

XXH3 sources seems huge (I can see why) so maybe i’ll pass in the short-term considering the relative lightweightness of those calls in the grand scheme of the frame, but maybe we can make it an option imconfig option to allow it and let user provide the sources.

wolfpld commented 4 years ago

Posting with original code for comparison. Note that this is completely untested.

ImU32 ImHashData(const void* data_p, size_t data_size, ImU32 seed)
{
    ImU32 crc = ~seed;
    const unsigned char* data = (const unsigned char*)data_p;
    const ImU32* crc32_lut = GCrc32LookupTable;
    while (data_size-- != 0)
        crc = (crc >> 8) ^ crc32_lut[(crc & 0xFF) ^ *data++];
    return ~crc;
}

ImU32 ImHashDataPtr(const void* data_p, ImU32 seed)
{
    ImU32 crc = ~seed;
    uint64_t data = (uint64_t)data_p;
    const ImU32* crc32_lut = GCrc32LookupTable;
    for (int i=0; i<8; i++)
    {
        crc = (crc >> 8) ^ crc32_lut[(crc & 0xFF) ^ (data & 0xFF)];
        data >>= 8;
    }
    return ~crc;
}

Since this is a special-case implementation, there's no need to pass data size, and the data_p variable interface can be modified to pass by value, and not by pointer.

The general implementation can be modified to have a fast path (reading 8 bytes at a time), then fallback to smaller read widths at the end (doing it in a 4, 2, 1 cascade vs reading the remaining up-to-7-bytes byte-by-byte is a balancing act between performance and code size). Special care needs to be taken to prevent misaligned reads (most systems handle it just fine, but it's UB and it will crash on some). A simple memcpy instead of pointer dereference is enough to prevent this, while producing identical machine code.

Of course, 32-bit implementation should be done a bit different. Maybe templates would be useful here?

I don't know if you care about endianness. Little endian systems will get the same value from both functions. Results on big endian will be different due to reversed byte order.

sergeyn commented 4 years ago

PR#3022 uses hw accelerated crc32 (for intel only for now). It'd be interesting to see how it behaves in your environment.

I also found out that ARM has hw accelerated crc32, I can extend this PR when I get to test on arm hardware.

Regards.

On Tue, Jul 21, 2020 at 12:20 AM Bartosz Taudul notifications@github.com wrote:

Posting with original code for comparison. Note that this is completely untested.

ImU32 ImHashData(const void data_p, size_t data_size, ImU32 seed) { ImU32 crc = ~seed; const unsigned char data = (const unsigned char)data_p; const ImU32 crc32_lut = GCrc32LookupTable; while (data_size-- != 0) crc = (crc >> 8) ^ crc32_lut[(crc & 0xFF) ^ *data++]; return ~crc; }

ImU32 ImHashDataPtr(const void data_p, ImU32 seed) { ImU32 crc = ~seed; uint64_t data = (uint64_t)data_p; const ImU32 crc32_lut = GCrc32LookupTable; for (int i=0; i<8; i++) { crc = (crc >> 8) ^ crc32_lut[(crc & 0xFF) ^ (data & 0xFF)]; data >>= 8; } return ~crc; }

Since this is a special-case implementation, there's no need to pass data size, and the data_p variable interface can be modified to pass by value, and not by pointer.

The general implementation can be modified to have a fast path (reading 8 bytes at a time), then fallback to smaller read widths at the end (doing it in a 4, 2, 1 cascade vs reading the remaining up-to-7-bytes byte-by-byte is a balancing act between performance and code size). Special care needs to be taken to prevent misaligned reads (most systems handle it just fine, but it's UB and it will crash on some). A simple memcpy instead of pointer dereference is enough to prevent this, while producing identical machine code.

Of course, 32-bit implementation should be done a bit different. Maybe templates would be useful here?

I don't know if you care about endianness. Little endian systems will get the same value from both functions. Results on big endian will be different due to reversed byte order.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ocornut/imgui/issues/3358#issuecomment-661378314, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFDDOV7PFATYPO4JCVZMXLR4S7KDANCNFSM4PBS6UOA .

ocornut commented 4 years ago

Right now at this very moment the complexity of both HW-enabled CRC32 and full on XXH3 makes them not super attractive for a quick merge, we've probably going to have to balance complexity for speed a little given hashing is not showing that high. The easy win posted above with regular data ptr however is an easy pick.

Most hash functions are going to be faster than CRC32 (that graph linked above shows XXH3 as magnitude faster, but those are for large amount of data, and we want to focus on small). One thing I worry about all the arch/compiler dependent stuff in #3022 is that it is certainly going to break and we're going to have to maintain that ourselves and those are tricky to maintain build compatibility for. The equivalent stuff in e.g. XXH3 is even more complex but it's going to be maintained by someone else so that's potentially reassuring.

malamanteau commented 4 years ago

With a 32 bit hash, you have a 25% chance of a collision after 50,000 hashes. Is there anything in dear imgui that protects against this? What happens if you do have a collision?

ocornut commented 4 years ago

With a 32 bit hash, you have a 25% chance of a collision after 50,000 hashes. Is there anything in dear imgui that protects against this? What happens if you do have a collision?

A collision may trigger the wrong widgets when you click on the other. You can test this by simply outputting two buttons with the same label.

We could add an extra layer of protection by comparing the window when comparing ActiveId, which would narrow the effect of collisions to submitted widgets within the same window.

wolfpld commented 4 years ago

We could add an extra layer of protection by comparing the window when comparing ActiveId, which would narrow the effect of collisions to submitted widgets within the same window.

Wouldn't this remove the need to use parent's seed for hashing, and hashing altogether for the 32 or 64-bit key case?

ocornut commented 4 years ago

Wouldn't this remove the need to use parent's seed for hashing, and hashing altogether for the 32 or 64-bit key case?

That's unrelated. The ID stack is based on the principle of inheriting seed else little of the logic would make sense.

Technically we could reduce likehood of collision further by comparing even more data of the to-become-active widget (such as bounding box) but at that point adopting 64-bit keys may be simply easier.

malamanteau commented 4 years ago

I think 64 bits will be good, at that size you have a 25% chance of a collision after 3.2 billion hashes. For most tools this should be sufficiently rare.

What are some potential strategies to eliminate collisions in dear imgui entirely?

I'd love to have a compile time option to not use hashing at all and use a stack approach or something like that (would use more memory), but that would require way more work I'd imagine?

metarutaiga commented 4 years ago

I think some methods.

Runtime :

COFF .rdata section mapping

if (InRDATASection(name))
{
return name - rdata + rdataHash;
}
return ImHashStr(name);

C++11 lambda

#define ImHashConstString(name) []() { static int hash = ImHashStr(name); return hash; }()

Compile time:

C++11 template<const char*>

template<const char* name> struct ImHashConstString
{
static constexpr int getHash()
{
    return ImHashStr(name);
}
};
char n[] = name;
int hash =  ImHashConstString<n>::getHash();

ocornut commented 4 years ago

I beliece are misunderstanding the core point that 99% of ImHash calls are seeded based on the value at the top of ID Stack, so this wouldn’t work.

metarutaiga commented 4 years ago

I beliece are misunderstanding the core point that 99% of ImHash calls are seeded based on the value at the top of ID Stack, so this wouldn’t work.

It means the hashed value XOR seed value right?

ocornut commented 4 years ago

That would make the operation commutative and increase risk of collisions. Also we rely on the possibility to append data into the hash, sometimes in a piece-wise manner, and this wouldn’t work.

If we care about perfs we can probably switch to a 64 bits hash function dumber than crc32 to avoid touching 1kb of hot data. I believe we are overthinking this topic.

metarutaiga commented 4 years ago

That would make the operation commutative and increase risk of collisions. Also we rely on the possibility to append data into the hash, sometimes in a piece-wise manner, and this wouldn’t work. If we care about perfs we can probably switch to a 64 bits hash function dumber than crc32 to avoid touching 1kb of hot data. I believe we are overthinking this topic.

So you would like to make a macro like IMGUI_CUSTOM_HASH in user_config.h, let someone want to challenge his hash codes? I think any platform dependency would be make codes complex and large.

ocornut commented 4 years ago

Feel free to add #if 0 around existing implemention and add your own in your branch. Right now from your previous post I think you are misunderstanding the situation and seeing a problem that likely is not really affecting you. On my end this is draining energy away from important problems so I don’t want to be involved with adding extra wrapping/defines.

ocornut commented 4 years ago

@wolfpld To clarify:

Wouldn't this remove the need to use parent's seed for hashing, and hashing altogether for the 32 or 64-bit key case?

Not at all. The "parent seed" only starts as window->ID but it changes with every PushID(), TreeNode() calls and other functions pushing into the id stack.

rodrigorc commented 1 year ago

Feel free to add #if 0 around existing implemention and add your own in your branch.

Sorry for reviving this thread, but it is still marked as "Open", so I took your advice here and added a 64-bit fasthash implementation, adapted from here (MIT licensed). With 64-bit hashes the risk of non-malicious collisions should be astronomically small.

My somewhat artificial benchmarks show that it is as fast as the current CRC32 for short strings (<10 chars) and up to 3 times faster for longer strings. (Maybe in an older non-64-bit machine it would be slower?)

Here is the code in case anyone wishes to try it. I think it works perfectly fine for me:

// imgui.h
typedef ImU64               ImGuiID;// A unique ID used by widgets (typically the result of hashing a stack of string)

// imgui.cpp
static inline ImU64 fasthash_mix(ImU64 h)
{
    h ^= h >> 23;
    h *= 0x2127599bf4325c37ULL;
    h ^= h >> 47;
    return h;
}

static ImU64 fasthash64(const void *buf, size_t len, ImU64 seed)
{
    const ImU64 m = 0x880355f21e6d1965ULL;
    const unsigned char *pos = (const unsigned char *)buf;
    const unsigned char *end = pos + (len / 8) * 8;
    ImU64 h = seed ^ (len * m);
    ImU64 v;

    while (pos != end) {
        // Beware of unaligned data, a good compiler in an architecture that allows 
        // unaligned access should compile this into a single load instruction.
        memcpy(&v, pos, 8);
        pos += 8;
        h ^= fasthash_mix(v);
        h *= m;
    }

    v = 0;

    switch (len & 7) {
    case 7: v ^= (ImU64)pos[6] << 48;
    case 6: v ^= (ImU64)pos[5] << 40;
    case 5: v ^= (ImU64)pos[4] << 32;
    case 4: v ^= (ImU64)pos[3] << 24;
    case 3: v ^= (ImU64)pos[2] << 16;
    case 2: v ^= (ImU64)pos[1] << 8;
    case 1: v ^= (ImU64)pos[0];
            h ^= fasthash_mix(v);
            h *= m;
    }
    return fasthash_mix(h);
}

// Known size hash
// It is ok to call ImHashData on a string with known length but the ### operator won't be supported.
ImGuiID ImHashData(const void* data_p, size_t data_size, ImU64 seed)
{
    return fasthash64(data_p, data_size, seed);
}

// Zero-terminated string hash, with support for ### to reset back to seed value
// We support a syntax of "label###id" where only "###id" is included in the hash, and only "label" gets displayed.
// - If we reach ### in the string we discard the hash so far and reset to the seed.
ImGuiID ImHashStr(const char* data_p, size_t data_size, ImU64 seed)
{
    if (!data_size)
        data_size = strlen(data_p);

    const void *start = memmem(data_p, data_size, "###", 3);
    if (start)
    {
        data_size -= (const char*)start - data_p;
        data_p = (const char*)start;
    }
    return fasthash64(data_p, data_size, seed);
}

ocornut / imgui

PushID() performance #3358