Expand swap capability and reorganize package

zietzm commented 5 years ago

In trying to apply the XSwap method to other networks, I have discovered that this implementation fails for several publicly-available networks. A few improvements address most of the underlying issues.

First, I added a preprocessing file to process files containing graphs. The underlying C++ functionality deals with edges represented as int*s. Since permutable graphs may be in files as integers or strings, (eg: node_a,node_b\nnode_c,node_d\n...), it is helpful to be able to process such graphs to a form that is usable by the fast XSwap implementation. Moreover, assigning new ids to the nodes allows us to ensure that nodes are indexed 1, 2, ..., num_edges. This kind of indexing allows larger graphs to be memory-feasible using the fast Cantor pair hash table.

Second, I added an additional hash table implementation using the C++ standard library std::unordered_set. While this is significantly slower than the Cantor pair hash table, it will work for networks with too many potential edges for the original hash table to fit in memory. (I have set the cutoff at 4 GB, though this was a totally arbitrary choice.) To ensure a uniform API, I added a final hash table wrapper, the HashTable class, which can call both the original EdgeHashTable and the new BigHashTable.

zietzm commented 5 years ago

As an example, using Zachary's karate club network with multiplier=10000, we find that it does not affect the swap outcomes, but significantly affects computation time.

Using slow hash table 43.5897 % unchanged 5.205 seconds

Using fast hash table 43.5897 % unchanged 0.134 seconds

Or using a protein-protein interaction graph due to Vidal and Ma'ayan with multiplier=10

Using slow hash table 1.1894 % not swapped 98.145 seconds

Using fast hash table 1.1894 % not swapped 0.025 seconds

dhimmel commented 5 years ago

In the previous comment, does "fast hash table" refer to the "Cantor pair hash table" and "slow hash table" refer to the std::unordered_set hash table?

zietzm commented 5 years ago

That is basically correct, though I have found that std::set actually gives better performance than std::unordered_set. I am not sure why, but std::unordered_set had been giving insertion performance much closer to the worst case (O(n)) than to the best case (constant). Since std::set uses a self-balancing binary search tree (O(log(n))), log(n) is proving to be faster than the worst-case hash table. I am looking into the possibility that a better, custom-specified hash function would improve performance on the hash table. Alternatively, it may be that frequent table resizing is slowing down the hash table (unordered_set) implementation. This could be remedied by pre-allocating a larger amount of memory.

Here's a quick rundown of the differences in implementations between set and unordered_set: https://www.geeksforgeeks.org/set-vs-unordered_set-c-stl/

zietzm commented 5 years ago

cd5fea4e0958f32812f1919109f798fea6c1b82f closes #7

hetio / xswap

Expand swap capability and reorganize package #6