Slow in a specific case

kanonka commented 10 months ago

Hi, I have a specific case of data when parallel hash map is performing quite slowly.

Type declaration:

typedef std::string SimpleString;

class srwlock { SRWLOCK _lock; public: srwlock() { InitializeSRWLock(&_lock); } void lock() { AcquireSRWLockExclusive(&_lock); } void unlock() { ReleaseSRWLockExclusive(&_lock); }

bool try_lock() { return TryAcquireSRWLockExclusive(&_lock) ? true : false; }
void lock_shared() { AcquireSRWLockShared(&_lock); }
void unlock_shared() { ReleaseSRWLockShared(&_lock); }
bool try_lock_shared() { return TryAcquireSRWLockShared(&_lock) ? true : false; }

};

typedef phmap::parallel_flat_hash_map<SimpleString, int, phmap::priv::hash_default_hash, phmap::priv::hash_default_eq, std::allocator<std::pair<const SimpleString, int>>, 8, srwlock> StrIntParallelMap;

Class declaration: class Test { StrIntParallelMap m_stringsMap; volatile long m_curIdx; public: int add(const SimpleString& str) { int newIndex = -1;

    auto f = m_stringsMap.find(str);
    if (f != m_stringsMap.end())
        newIndex = f->second; // could be < 0 if just inserted by another thread
    else if (m_stringsMap.insert(std::make_pair(str, newIndex)).second)
    {
        newIndex = InterlockedIncrement(&m_curIdx);
        m_stringsMap[str] = newIndex;
        return newIndex;
    }            

    while (newIndex < 0)
    {
        std::this_thread::yield();
        newIndex = m_stringsMap.find(str)->second;
    }
        }

}

This works fine on most datasets. But one is weird: There are about 240 million string values that I'm parsing, and I call "add" for each value. About 239 millions of them is the very same string of about ~150 chars. Others are pretty much random, but shorter. So, when calling "add(value)" in parallel_for loop, my CPU load drops to ~19%, and the process takes about 233 seconds. If I replace StrIntParallelMap with concurrency::concurrent_unordered_map<SimpleString, int>, no CPU load drop, and process is done much faster in ~17 seconds.

Any idea?

greg7mdp commented 10 months ago

Hi @kanonka , I already mentioned the right way to implement this in issue #119, Here it is (BTW -1 will never be inserted so you don't need the loop at the end):

class Test
{
    StrIntParallelMap m_stringsMap;
    volatile long     m_curIdx;

public:
    int add(const SimpleString& str)
    {
        int newIndex = -1;
        m_stringsMap.lazy_emplace_l(
            str,
            [&](Map::value_type& p) {
                newIndex = p.second;
            },                                // called only when key was already present
            [&](const Map::constructor& ctor) // construct value_type in place when key not present
            {
                newIndex = InterlockedIncrement(&m_curIdx);
                ctor(str, newIndex);
            });
        return newIndex;
    }
};

Please let me know how this performs.

kanonka commented 10 months ago

I was using you example from #119; code provided in question was just for interoperability with concurrent_unordered_map for test.

Yes, very same behavior, no difference. Original profiling for code in question showed bottleneck at map.find(str). Let me see what profiling will show now.

I think the problem is that I'm hitting very same bucket and same key all the time (like 99.9% of time), which causes contention. But then why concurrent_unordered_map handles this case with no problem?

PS. Ok, just tested. Code that you provided is 193 seconds instead of 233 seconds, but still far cry from 17 sec of concurrent_unordered_map. The hot path (lines 3787 to 3790 in phmap.h):

template std::tuple<Inner*, size_t, bool> find_or_prepare_insert_with_hash(sizet hashval, const K& key, typename Lockable::UniqueLock &mutexlock) { Inner& inner = sets[subidx(hashval)]; auto& set = inner.set_; mutexlock = std::move(typename Lockable::UniqueLock(inner)); auto p = set.find_or_prepare_insert(key, hashval); // std::pair<size_t, bool> return std::make_tuple(&inner, p.first, p.second); }

greg7mdp commented 10 months ago

Yes, you are right that the issue is that you're hitting the same submap (and therefore the same mutex) all the time. Right now the API takes a write lock always, but I could update the code to take a read lock for the find, and a write lock if the key is not present and the hash map will be modified. It would probably help a lot for this use case.

greg7mdp commented 10 months ago

@kanonka , do you have a complete test program I could use to reproduce this issue?

kanonka commented 10 months ago

I will craft one tomorrow and upload.

greg7mdp commented 10 months ago

Actually, please try your original code (with the m_stringsMap.find(str);), and the following definition for LockableImpl<srwlock>:

class srwlock
{
    SRWLOCK _lock;

public:
    srwlock() { InitializeSRWLock(&_lock); }
    void lock() { AcquireSRWLockExclusive(&_lock); }
    void unlock() { ReleaseSRWLockExclusive(&_lock); }

    bool try_lock() { return TryAcquireSRWLockExclusive(&_lock) ? true : false; }
    void lock_shared() { AcquireSRWLockShared(&_lock); }
    void unlock_shared() { ReleaseSRWLockShared(&_lock); }
    bool try_lock_shared() { return TryAcquireSRWLockShared(&_lock) ? true : false; }
};

namespace phmap {

template<>
class LockableImpl<srwlock> : public srwlock
{
public:
    using mutex_type      = srwlock;
    using Base            = LockableBaseImpl<srwlock>;
    using SharedLock      = typename Base::ReadLock;
    using UpgradeLock     = typename Base::WriteLock;
    using UniqueLock      = typename Base::WriteLock;
    using SharedLocks     = typename Base::ReadLocks;
    using UniqueLocks     = typename Base::WriteLocks;
    using UpgradeToUnique = typename Base::DoNothing; // we already have unique ownership
};

}

kanonka commented 10 months ago

I tried that code. It gave significant speed improvement - 77 seconds now, but introduced bug: sometimes (like 1-3 times out of ~240 mln calls) I'm getting uninitialized value in f->second, specifically in this part of the code (very beginning):

if (f != m_stringsMap.end()) newIndex = f->second; // could be < 0 if just inserted by another thread else ...

kanonka commented 10 months ago

Test.zip

Attached the test case code. Unfortunately, in this test code the bug does not manifest itself for whatever reason, but in my real program (on real data) it does.

greg7mdp commented 10 months ago

Thanks @kanonka , I can see what the issue is when you use find. The right solution would be to improve lazy_emplace_l.

Is there a reason why you don't just switch to concurrency::concurrent_unordered_map?

kanonka commented 10 months ago

I have a problem with concurrency::concurrent_unordered_map - it takes like forever to destroy when key is anything but simple type (int/double etc). In my program I'm actively copying/destroying these maps. Even simplest concurrent_unordered_map with string key and ~50K entries takes like ~10 seconds to destroy, which is totally unacceptable, and I found no way to speed that up. Your map is brilliant in a sense that copy and destroy is almost immediate. I create map may be once or twice during program lifetime, but copy/destroy of the copy happens hundreds times when user interacts with program, so I have to balance what performance is more important - creation or destruction.

If you can improve lazy_emplace_l, it would be great!

greg7mdp commented 10 months ago

Hi @kanonka ,

I have improved lazy_emplace_l in the branch try_emplace_fine_locking? Please try this branch. You will also have to add using ReadWriteLock = typename Base::ReadWriteLock; to the LockableImpl, so it will look like this:

namespace phmap {

    template<>
    class LockableImpl<srwlock> : public srwlock
    {
    public:
        using mutex_type    = srwlock;
        using Base          = LockableBaseImpl<srwlock>;
        using SharedLock    = typename Base::ReadLock;
        using ReadWriteLock = typename Base::ReadWriteLock;
        using UpgradeLock   = typename Base::WriteLock;
        using UniqueLock    = typename Base::WriteLock;
        using SharedLocks   = typename Base::ReadLocks;
        using UniqueLocks   = typename Base::WriteLocks;
        using UpgradeToUnique = typename Base::DoNothing; // we already have unique ownership
    };
}

greg7mdp commented 10 months ago

Your test program is still slower than concurrent_unordered_map, but about 5 times faster than before.

kanonka commented 10 months ago

Thank you! Real life data now is down to 67 sec (from 193), and no strange goofy values so far. I will let it run overnight for multiple cycles to make sure no bugs introduced, and will do some more testing tomorrow for the rest of functionality. But so far so good. Man, you are my hero! Thanks again!

greg7mdp commented 10 months ago

Glad to hear and to be of help. Please let me know tomorrow how your testing goes and if it is all good I'll merge the branch. Thanks for pointing out this edge case!

greg7mdp commented 10 months ago

@kanonka I realized there was a flaw in my change from yesterday. Please update to the latest version of the branch. It might be a little bit slower.

kanonka commented 10 months ago

Just tried. Speed is the same (~67 sec). Overnight testing didn't show any problem.

greg7mdp commented 10 months ago

Great. We should be good then!

kanonka commented 10 months ago

I did some testing of the rest of the functionality - seems good. I guess this can now go to the main branch.

kanonka commented 10 months ago

Thank you so much - case resolved!

greg7mdp commented 9 months ago

@kanonka I just added a definition for srwlock in the phmap headers, so you can use phmap::srwlock as a mutex without defining it in your code.

greg7mdp / parallel-hashmap

Slow in a specific case #215

Type declaration: