Closed kanonka closed 10 months ago
Hi @kanonka , I already mentioned the right way to implement this in issue #119, Here it is (BTW -1 will never be inserted so you don't need the loop at the end):
class Test
{
StrIntParallelMap m_stringsMap;
volatile long m_curIdx;
public:
int add(const SimpleString& str)
{
int newIndex = -1;
m_stringsMap.lazy_emplace_l(
str,
[&](Map::value_type& p) {
newIndex = p.second;
}, // called only when key was already present
[&](const Map::constructor& ctor) // construct value_type in place when key not present
{
newIndex = InterlockedIncrement(&m_curIdx);
ctor(str, newIndex);
});
return newIndex;
}
};
Please let me know how this performs.
I was using you example from #119; code provided in question was just for interoperability with concurrent_unordered_map for test.
Yes, very same behavior, no difference. Original profiling for code in question showed bottleneck at map.find(str). Let me see what profiling will show now.
I think the problem is that I'm hitting very same bucket and same key all the time (like 99.9% of time), which causes contention. But then why concurrent_unordered_map handles this case with no problem?
PS. Ok, just tested. Code that you provided is 193 seconds instead of 233 seconds, but still far cry from 17 sec of concurrent_unordered_map. The hot path (lines 3787 to 3790 in phmap.h):
template
Yes, you are right that the issue is that you're hitting the same submap (and therefore the same mutex) all the time. Right now the API takes a write lock always, but I could update the code to take a read lock for the find, and a write lock if the key is not present and the hash map will be modified. It would probably help a lot for this use case.
@kanonka , do you have a complete test program I could use to reproduce this issue?
I will craft one tomorrow and upload.
Actually, please try your original code (with the m_stringsMap.find(str);
), and the following definition for LockableImpl<srwlock>
:
class srwlock
{
SRWLOCK _lock;
public:
srwlock() { InitializeSRWLock(&_lock); }
void lock() { AcquireSRWLockExclusive(&_lock); }
void unlock() { ReleaseSRWLockExclusive(&_lock); }
bool try_lock() { return TryAcquireSRWLockExclusive(&_lock) ? true : false; }
void lock_shared() { AcquireSRWLockShared(&_lock); }
void unlock_shared() { ReleaseSRWLockShared(&_lock); }
bool try_lock_shared() { return TryAcquireSRWLockShared(&_lock) ? true : false; }
};
namespace phmap {
template<>
class LockableImpl<srwlock> : public srwlock
{
public:
using mutex_type = srwlock;
using Base = LockableBaseImpl<srwlock>;
using SharedLock = typename Base::ReadLock;
using UpgradeLock = typename Base::WriteLock;
using UniqueLock = typename Base::WriteLock;
using SharedLocks = typename Base::ReadLocks;
using UniqueLocks = typename Base::WriteLocks;
using UpgradeToUnique = typename Base::DoNothing; // we already have unique ownership
};
}
I tried that code. It gave significant speed improvement - 77 seconds now, but introduced bug: sometimes (like 1-3 times out of ~240 mln calls) I'm getting uninitialized value in f->second, specifically in this part of the code (very beginning):
if (f != m_stringsMap.end()) newIndex = f->second; // could be < 0 if just inserted by another thread else ...
Attached the test case code. Unfortunately, in this test code the bug does not manifest itself for whatever reason, but in my real program (on real data) it does.
Thanks @kanonka , I can see what the issue is when you use find
. The right solution would be to improve lazy_emplace_l
.
Is there a reason why you don't just switch to concurrency::concurrent_unordered_map
?
I have a problem with concurrency::concurrent_unordered_map - it takes like forever to destroy when key is anything but simple type (int/double etc). In my program I'm actively copying/destroying these maps. Even simplest concurrent_unordered_map with string key and ~50K entries takes like ~10 seconds to destroy, which is totally unacceptable, and I found no way to speed that up. Your map is brilliant in a sense that copy and destroy is almost immediate. I create map may be once or twice during program lifetime, but copy/destroy of the copy happens hundreds times when user interacts with program, so I have to balance what performance is more important - creation or destruction.
If you can improve lazy_emplace_l, it would be great!
Hi @kanonka ,
I have improved lazy_emplace_l
in the branch try_emplace_fine_locking
? Please try this branch.
You will also have to add using ReadWriteLock = typename Base::ReadWriteLock;
to the LockableImpl
, so it will look like this:
namespace phmap {
template<>
class LockableImpl<srwlock> : public srwlock
{
public:
using mutex_type = srwlock;
using Base = LockableBaseImpl<srwlock>;
using SharedLock = typename Base::ReadLock;
using ReadWriteLock = typename Base::ReadWriteLock;
using UpgradeLock = typename Base::WriteLock;
using UniqueLock = typename Base::WriteLock;
using SharedLocks = typename Base::ReadLocks;
using UniqueLocks = typename Base::WriteLocks;
using UpgradeToUnique = typename Base::DoNothing; // we already have unique ownership
};
}
Your test program is still slower than concurrent_unordered_map
, but about 5 times faster than before.
Thank you! Real life data now is down to 67 sec (from 193), and no strange goofy values so far. I will let it run overnight for multiple cycles to make sure no bugs introduced, and will do some more testing tomorrow for the rest of functionality. But so far so good. Man, you are my hero! Thanks again!
Glad to hear and to be of help. Please let me know tomorrow how your testing goes and if it is all good I'll merge the branch. Thanks for pointing out this edge case!
@kanonka I realized there was a flaw in my change from yesterday. Please update to the latest version of the branch. It might be a little bit slower.
Just tried. Speed is the same (~67 sec). Overnight testing didn't show any problem.
Great. We should be good then!
I did some testing of the rest of the functionality - seems good. I guess this can now go to the main branch.
Thank you so much - case resolved!
@kanonka I just added a definition for srwlock
in the phmap headers, so you can use phmap::srwlock
as a mutex without defining it in your code.
Hi, I have a specific case of data when parallel hash map is performing quite slowly.
Type declaration:
typedef std::string SimpleString;
class srwlock { SRWLOCK _lock; public: srwlock() { InitializeSRWLock(&_lock); } void lock() { AcquireSRWLockExclusive(&_lock); } void unlock() { ReleaseSRWLockExclusive(&_lock); }
};
typedef phmap::parallel_flat_hash_map<SimpleString, int, phmap::priv::hash_default_hash,
phmap::priv::hash_default_eq,
std::allocator<std::pair<const SimpleString, int>>, 8, srwlock> StrIntParallelMap;
Class declaration: class Test { StrIntParallelMap m_stringsMap; volatile long m_curIdx; public: int add(const SimpleString& str) { int newIndex = -1;
}
This works fine on most datasets. But one is weird: There are about 240 million string values that I'm parsing, and I call "add" for each value. About 239 millions of them is the very same string of about ~150 chars. Others are pretty much random, but shorter. So, when calling "add(value)" in parallel_for loop, my CPU load drops to ~19%, and the process takes about 233 seconds. If I replace StrIntParallelMap with concurrency::concurrent_unordered_map<SimpleString, int>, no CPU load drop, and process is done much faster in ~17 seconds.
Any idea?