ktprime / emhash

Fast and memory efficient c++ flat hash map/set
MIT License
471 stars 35 forks source link

rehash(), insert() etc incorrect "noexcept" modifier. result of malloc can return nullptr #1

Closed hordi closed 4 years ago

hordi commented 4 years ago

After quick-look in the code of hashtable7.hpp... rehash(), insert() etc functions can't have "noexcept" modifier because: 1.Any new-ctor can throw an exception (NEW_KVALUE etc macro). 2.Need to check result of "malloc" and thow bad_alloc exception if it's nullptr 3.For performance reason (may be - need to check) has sense try to replace combination of "free,malloc" to "realloc" 4.Exceptions unsafe (non-consistent state)

In next function (looks like not only this place, but in anyway) if clone throw an exception (any ctor of Key or Value can do it) we have as minimum the memory leak, as maximum - dtor call missing for group of successfully cloned objects. HashMap(const HashMap& other) { _pairs = (PairT)malloc((2 + other._num_buckets) sizeof(PairT) + other._num_buckets / 8 + sizeof(uint64_t)); clone(other); //THROWN EXCEPTION }

  1. in the ::clone function instead of std::is_pod it's better to use std::is_trivially_copyable 6.key_eq() should return reference type (not object). Nobody knows how complex that type can be.
ktprime commented 4 years ago

thanks for your advice, I will fix them. In order to to compiled by c++ 0x on some gcc lower version(ex 4.6.x) it use some not safe way for point 5.

hordi commented 4 years ago

You could use std::is_pod for compiler that doesn't support C++11 and more powerfull function for C++11. But in anyway that code is safe right now.

hordi commented 4 years ago

Also currently I see only ::insert and ::insert_unique functions which supports strongly or"const" or "movable&&" parameters. Properly, for good performance, similar function should also accept any combination of "const and &&".

ktprime commented 4 years ago

You are right. I need some time to fix the api problem.

hordi commented 4 years ago

reserve(uint32_t num_elems) and rehash(uint32_t required_buckets): This code looks suspicious. If input size is big enough, may be, it'll be in "infinitive cycle". while (num_buckets < required_buckets) { num_buckets *= 2; }

ktprime commented 4 years ago

reserve(uint32_t num_elems) and rehash(uint32_t required_buckets): This code looks suspicious. If input size is big enough, may be, it'll be in "infinitive cycle". while (num_buckets < required_buckets) { num_buckets *= 2; }

if hash map the size > 2^31(huge), fix it by change uint64_t num_buckets.

ktprime commented 4 years ago

I have fixed some issuses as your point out and some also left.

now your hash map hrd is added into my benchmark ebench.cpp (emhash\bench\ebench.cpp) which is a very detail/full api test, it's different from other bench code use mutil test case. you can try it. some test result looks like as follow.

C:\WINDOWS\system32\cmd.exe /c (g++ -DET -std=c++17 -mtune=native -mavx2 -mfma -O3 -s -pipe ebench.cpp -o ebench.exe ^& ebench)

erase_find_half 17 phmap flat 100 19 martin flat 91 27 emhash6 63 27 emhash7 63 73 hrd set 23

erase_half 34 martin flat 100 48 emhash7 72 48 emhash6 71 48 phmap flat 71 71 hrd set 48

erase_reinsert 137 martin flat 100 138 phmap flat 99 144 hrd set 95 148 emhash6 92 151 emhash7 91

find_hit_all 35 emhash6 100 36 emhash7 98 47 hrd set 75 50 martin flat 70 55 phmap flat 64

find_hit_half 32 martin flat 100 36 emhash6 88 36 phmap flat 88 38 emhash7 83 63 hrd set 50

find_miss_all 26 phmap flat 100 26 martin flat 98 38 emhash6 68 41 emhash7 62 80 hrd set 32

insert_find_erase 139 hrd set 100 167 emhash7 83 188 phmap flat 74 206 emhash6 67 277 martin flat 50

insert_high_load 105 phmap flat 100 107 martin flat 98 120 emhash6 87 122 emhash7 85 152 hrd set 68

insert_no_reserve 138 hrd set 100 167 phmap flat 82 192 emhash7 71 199 emhash6 69 231 martin flat 59

insert_reserve 77 emhash7 100 83 hrd set 93 86 emhash6 89 138 phmap flat 56 235 martin flat 32

======== hash top1 top2 top3 ======================= emhash6 0.0 1.0 3 emhash7 4.0 1.0 0 hrd set 1.0 0.0 0 phmap flat 0.0 3.0 2 ======== hash score ================================ emhash6 79 emhash7 80 hrd set 68 martin flat 79 phmap flat 83

full test is in https://github.com/ktprime/emhash/blob/master/hash_bench.zip

hordi commented 4 years ago

Could you prepare VS project for this bench-project or add standard CMakefile.txt, I have VS compiler on Windows only.

Some remarks... 1.It is wrong to use __cplusplus >= 201103L macro - VS2010 doesn't have is_trivially* etc functions but has that c++ version on. 2.It is wrong to use (_MSC_VER > 1600) - that is IDE version, not actual compiler. It is possible to use VS2019 but project with Platform Toolset VS2010 or SDK7.1. I think, for MSVS compiler need to use some minimal Windows SDK Version, for example, 8.1 is available only in VS compiler that supports C++11 completely. As option I could propose this:

#ifdef _MSC_VER
#  ifndef _WIN32_WINNT
#    include <sdkddkver.h>
#  endif
#  if _WIN32_WINNT >= 0x0603 //SDK 8.1 and up
#    //define some C++11 types or macro
#  endif
#endif //_MSC_VER

3.Instead of duplicating functions (::insert etc) body with different input modifiers it is better to implement one only function like:

    template<typename K, typename V> 
    uint32_t insert_unique(K&& key, V&& value)
    {
        check_expand_need();
        auto bucket = find_unique_bucket(key);
        NEW_KVALUE(std::forward<K>(key), std::forward<V>(value), bucket);
        return bucket;
    }
ktprime commented 4 years ago

I'll give a CMakefile.txt soon, In fact I also use vs 2017 to bench. as a work around you can drag the ebench.cpp to your vs projcet and set include file.

try use clang++ and it's easy be integrated and used in vs tool. http://releases.llvm.org/9.0.0/LLVM-9.0.0-win64.exe

hordi commented 4 years ago

Take a look this function: ValueT set_get(const KeyT& key, const ValueT& value)

Next code overwrites current value by the NEW, and returns also NEW value. Expected behavior - return OLD value. const ValueT& old_value = GET_VAL(_pairs, bucket); GET_VAL(_pairs, bucket) = value; return old_value;

Next function (and may be some other with input iterator arithmetic) uses "end - begin" that limited type of input iterators (can't use std::list::iterator for example). It needs to be replaced to std::distance.

    template <typename Iter>
    void insert_unique(Iter begin, Iter end)
    {
        reserve(end - begin + _num_filled);
        for (; begin != end; ++begin) {
            insert_unique(*begin);
        }
    }

About ebench.cpp. so many lines use "emplace" function with input "const" object, but really it should be used or "insert" function or need to remove "const" in the loop and use std::move in that emplace-function but actually all these places about "insert"-functional.

Now:

for (const auto v : vList)
            sum += ahash.emplace(v, TO_VAL(0)).second;

Should be:

for (const auto& v : vList)
            sum += ahash.insert(v, TO_VAL(0)).second;
ktprime commented 4 years ago

other hash map do not suppor insert(key,value), so I have to use emplace to replace it. I have fixed some isssue.

ktprime commented 4 years ago

the full test code and CMakeLists.txt is in https://github.com/ktprime/emhash/blob/master/hash_bench.zip

hordi commented 4 years ago

will run it on 2 PC: XEON & Skylake CPUs...

About set_get. Are you sure it is good to copy that value every time? Actually, 1 redundant copy, if Value-type is non-trivial it'll be some perf degradation.

Now

            const ValueT old_value = GET_VAL(_pairs, bucket);
            GET_VAL(_pairs, bucket) = value;
            return old_value;

It is better something like this:

            ValueT old_value(value);
            std::swap( old_value, GET_VAL(_pairs, bucket));
            return old_value;
ktprime commented 4 years ago

fix issue