greg7mdp / parallel-hashmap

A family of header-only, very fast and memory-friendly hashmap and btree containers.
https://greg7mdp.github.io/parallel-hashmap/
Apache License 2.0
2.47k stars 234 forks source link

phmap_dump saving a lot more than needed #226

Closed iliasaz closed 8 months ago

iliasaz commented 8 months ago

Hi! First of all - this is an awesome library, and thank you for developing and supporting it!! I'm developing a swift binding for it, and I'll open source it as soon as it's ready.

In the meanwhile, I noticed a weird issue with saving a map. Please see a test case below. Basically, if I create a large map and save it, and the clear it and create a much smaller map and save it again, the second file is as large as the first one. But if I just create a small map and save it, I get an adequate file size for it. Is there anything I'm not doing correctly? Thanks!

void saveMap(string filePath, parallel_flat_hash_map<uint64_t, uint32_t> map) {
    cout << "saving to file " << filePath << endl;
    phmap::BinaryOutputArchive ar_out(filePath.c_str());
    map.phmap_dump(ar_out);
    cout << "finished saving" << endl;
}

ifstream::pos_type getFileSize(const char* filename) {
    std::ifstream in(filename, std::ifstream::ate | std::ifstream::binary);
    return in.tellg();
}

void testClearSave() {
    parallel_flat_hash_map<uint64_t, uint32_t> map;
    int n = 100*1000*1000;

    cout << "generating " << n << " map values\n";
    map.reserve(n);
    for (int i = 0; i < n; i++) {
        int v = rand();
        map[uint64_t(v)] = uint32_t(v);
    }
    cout << "map count: " << map.size() << "\n";

    string filePath100m = "/Users/ilia/Downloads/dump_100m.bin";
    saveMap(filePath100m, map);
    cout << "100m file size: " << getFileSize(filePath100m.c_str()) << endl;

    map.clear();
    cout << "cleared map. new map size: " << map.size() << endl;
    cout << "generating " << 1000 << " map values\n";
    map.reserve(1000);
    for (int i = 0; i < 1000; i++) {
        int v = rand();
        map[uint64_t(v)] = uint32_t(v);
    }
    cout << "new map size: " << map.size() << endl;

    string filePath1k = "/Users/ilia/Downloads/dump_100m.bin";
    saveMap(filePath1k, map);
    cout << "1k file size: " << getFileSize(filePath1k.c_str()) << endl;
}

int main(int argc, const char * argv[]) {
    testClearSave();
    return 0;
}

Here is the output:

generating 100000000 map values
map count: 100000000
saving to file /Users/ilia/Downloads/dump_100m.bin
finished saving
100m file size: **2281701768**

cleared map. new map size: 0
generating 1000 map values
new map size: 1000
saving to file /Users/ilia/Downloads/dump_100m.bin
finished saving
1k file size: **2281701768**
greg7mdp commented 8 months ago

Good question, @iliasaz , and thank you for the kind words.

What you observe is expected, as phmap_dump is a phmap extension that saves the whole memory array as is to disk, so it will always be the full reserved size. The benefit is that the loading is very fast, because the entries do not have to be inserted into the hash one by one. It also requires the entries to be trivially_copiable.

If you prefer the classic behavior, it is easy to save and load the array by writing a list of entries to disk, and then reinserting them into a new map at load. You may look at serialize.cc in the examples directory, which contains code using the cereal library to do just that.

iliasaz commented 8 months ago

My map is <uint64,uint32>, which I believe is trivially_copyable. And I definitely need phmap_dump way of saving the entire memory state. But the boundaries of that memory should have changed after clear followed by the second reserve, should they not? Is there anything I can do to shrink or completely deallocate that memory?

greg7mdp commented 8 months ago

My map is <uint64,uint32>, which I believe is trivially_copyable

yes, definitely!

But the boundaries of that memory should have changed after clear followed by the second reserve

no, that would have works for the flat_hash_map, but not for the parallel version. Instead of map.clear();, can you try:

parallel_flat_hash_map<uint64_t, uint32_t>().swap(map);

Thanks!

iliasaz commented 8 months ago

Duh... I forgot about this neat C++ trick with the swap 😀 It does work, and the saved file has a normal size now. Thanks a lot!!