greg7mdp / parallel-hashmap

A family of header-only, very fast and memory-friendly hashmap and btree containers.
https://greg7mdp.github.io/parallel-hashmap/
Apache License 2.0
2.46k stars 232 forks source link

Custom pointer type support ? #87

Open kleunen opened 3 years ago

kleunen commented 3 years ago

It seems custom pointer types are not supported with the parallel hashmap: https://github.com/greg7mdp/parallel-hashmap/blob/master/parallel_hashmap/phmap.h#L843-L846

Would this be difficult to add ? Because I would like to apply the parallel hashmap to a project to process openstreetmap data. This involves loading several billions of geometrical data. We do this by loading the the data into a boost unordered_map backed by a boost interprocess mmap file. This way, the data is hashed but backed by a file on filesystem. This is required, because converting the planet actually involved processing 60G of compressed data.

The loading process is now completely single threaded, but ideally, we would like to load the node/ways/relations store with openstreetmap from the planet osm file in parallel. Parallel hashmap provides an interesting approach for this, but this approach would require using the boost interprocess offset pointer.

https://github.com/systemed/tilemaker

Is there a particular reason why custom pointer types are not allowed ?

greg7mdp commented 3 years ago

This is a restriction I inherited from the original Abseil source code. Can you tell me what kind of pointer type you would like to use?

kleunen commented 3 years ago

Boost Interprocess Offset pointer: https://www.boost.org/doc/libs/1_75_0/doc/html/interprocess/offset_ptr.html

Please observe the following: One of the big problems of offset_ptr is the representation of the null pointer. The null pointer can't be safely represented like an offset, since the absolute address 0 is always outside of the mapped region. Due to the fact that the segment can be mapped in a different base address in each process the distance between the address 0 and offset_ptr is different for every process.

Some implementations choose the offset 0 (that is, an offset_ptr pointing to itself) as the null pointer pointer representation but this is not valid for many use cases since many times structures like linked lists or nodes from STL containers point to themselves (the end node in an empty container, for example) and 0 offset value is needed. An alternative is to store, in addition to the offset, a boolean to indicate if the pointer is null. However, this increments the size of the pointer and hurts performance.

In consequence, offset_ptr defines offset 1 as the null pointer, meaning that this class can't point to the byte after its own this pointer:

greg7mdp commented 3 years ago

Thanks! Maybe it is doable to support this. I'm wondering how I could test it. Would you have a small example with a boost unordered_map/set using a custom pointer type allocator?

kleunen commented 3 years ago

Yes, i can make small example

kleunen commented 3 years ago

This is the easy case: https://wandbox.org/permlink/uFzkywyr2StzUq4i

But we actually need scoped allocation with piecewise construction as well, i can add example of this later.

greg7mdp commented 3 years ago

Great, thanks @kleunen, let me look at that first!

kleunen commented 3 years ago

This is with a scoped allocation example: https://wandbox.org/permlink/08K4Pd3wVUcscwJG

Vector in map using a scoped allocator

greg7mdp commented 3 years ago

Thanks. Please give me a couple days to look into it.

kleunen commented 3 years ago

I cleaned up the example a little bit more: https://wandbox.org/permlink/21FjLvAenx9j76P9

kleunen commented 3 years ago

Hello @greg7mdp, do you think this is possible ? Or are there any show-stopping issues ?

greg7mdp commented 3 years ago

Hum, I'm not sure. I did try your example with a change in phmap but I got a very strange error. I need to look more into it.

image

kleunen commented 3 years ago

What was the error you got ?

kleunen commented 3 years ago

I actually worked around the issue now, by implementing a custom allocator, which keeps the mmap region open, even after resizing. This way, the pointers don't need to be offset pointers, because the old region also stayed mapped.

I do still think it would be useful if parallel-hashmap supports custom pointers. I think you should consistently use Alloctor::pointer and not convert to raw pointers.

greg7mdp commented 3 years ago

I think you should consistently use Alloctor::pointer and not convert to raw pointers.

How can I do that since I allocate the hashmap array itself using the allocator and need to access the entries into this array?

kleunen commented 3 years ago

You don't actually have to manage the array yourself, you can use std::vector to allocate the array of pairs for you. You can make the actual storage container a template argument:

using entry_t = std::pair<int, int>;
using entry_allocator_t = std::allocator< std::pair<int, int> >;

// Storing entries
template<class T, class A>
using bucket_storage_t = std::vector<T, A>;

int main()
{   
    bucket_storage_t<entry_t, entry_allocator_t> entries(10);
    for(int i = 0; i < 10; ++i)
        entries[i] = std::make_pair(i, i); 
}

In the cases of the parallel hash map, you also need a storage type and allocator for the vector of vectors:

#include <utility>
#include <vector>

// Allocate entries 
using entry_t = std::pair<int, int>;
using entry_allocator_t = std::allocator< std::pair<int, int> >;
using bucked_list_allocator_t = std::allocator< std::vector< std::pair<int, int> > >;

// Storing entries
template<class T, class A>
using bucket_storage_t = std::vector<T, A>;

// Storing list of buckets
template<class T, class A>
using bucket_list_storage_t = std::vector< bucket_storage_t<T, entry_allocator_t>, A>;

int main()
{   
    bucket_list_storage_t<entry_t, bucked_list_allocator_t> bucket_list(10);
    for(int i = 0; i < 10; ++i) 
        bucket_list.resize(100);
} 

But possibly you can also use a deque for this? No wait. The outer list is a static array in your case.