Closed zietzm closed 5 years ago
As an example, using Zachary's karate club network with multiplier=10000
, we find that it does not affect the swap outcomes, but significantly affects computation time.
Using slow hash table 43.5897 % unchanged 5.205 seconds
Using fast hash table 43.5897 % unchanged 0.134 seconds
Or using a protein-protein interaction graph due to Vidal and Ma'ayan with multiplier=10
Using slow hash table 1.1894 % not swapped 98.145 seconds
Using fast hash table 1.1894 % not swapped 0.025 seconds
In the previous comment, does "fast hash table" refer to the "Cantor pair hash table" and "slow hash table" refer to the std::unordered_set
hash table?
That is basically correct, though I have found that std::set
actually gives better performance than std::unordered_set
. I am not sure why, but std::unordered_set
had been giving insertion performance much closer to the worst case (O(n)) than to the best case (constant). Since std::set
uses a self-balancing binary search tree (O(log(n))), log(n) is proving to be faster than the worst-case hash table. I am looking into the possibility that a better, custom-specified hash function would improve performance on the hash table. Alternatively, it may be that frequent table resizing is slowing down the hash table (unordered_set
) implementation. This could be remedied by pre-allocating a larger amount of memory.
Here's a quick rundown of the differences in implementations between set
and unordered_set
:
https://www.geeksforgeeks.org/set-vs-unordered_set-c-stl/
cd5fea4e0958f32812f1919109f798fea6c1b82f closes #7
In trying to apply the XSwap method to other networks, I have discovered that this implementation fails for several publicly-available networks. A few improvements address most of the underlying issues.
First, I added a preprocessing file to process files containing graphs. The underlying C++ functionality deals with edges represented as
int*
s. Since permutable graphs may be in files as integers or strings, (eg:node_a,node_b\nnode_c,node_d\n
...), it is helpful to be able to process such graphs to a form that is usable by the fast XSwap implementation. Moreover, assigning new ids to the nodes allows us to ensure that nodes are indexed 1, 2, ...,num_edges
. This kind of indexing allows larger graphs to be memory-feasible using the fast Cantor pair hash table.Second, I added an additional hash table implementation using the C++ standard library
std::unordered_set
. While this is significantly slower than the Cantor pair hash table, it will work for networks with too many potential edges for the original hash table to fit in memory. (I have set the cutoff at 4 GB, though this was a totally arbitrary choice.) To ensure a uniform API, I added a final hash table wrapper, theHashTable
class, which can call both the originalEdgeHashTable
and the newBigHashTable
.