Evaluate effectiveness of Hamming DHT

This issue is meant to be an outline of our testing and evaluation strategy for the Hamming DHT implementation and a place to collect notes and observations.

Initial testing thoughts

Here's the initial writeup from about a week ago that I posted to slack:

So, the first order of business is to replicate the results of the Hamming DHT evaluation to show that we've implemented it correctly.

Their data is based on applying the RHH function to the UCI Adult Data Set (CSV dumps: both adult.data and adult.test are used).

The dataset contains "profiles" for about 50 thousand adults, taken from census data. Each profile contains 14 attributes, some of which are numeric (e.g. age, hours worked per week, etc), and some are "enum" types (e.g occupation, race, etc). It's not clear from the Hamming DHT paper exactly how the attributes are mapped into a vector representation. I'm guessing that they were treated as 14-dimension vectors, with the range of each dimension determined by the number of enum values for that attribute. The range of the numeric attributes is less clear, but could either be guessed or determined by taking the MIN and MAX of the actual values in the dataset.

At any rate, each profile vector was fed into the RHH function to produce a 128 bit key, and stored in Chord and Hamming-Chord DHTs. Tests were run against DHTs with 1000 peers and 10,000 peers.

Each key obtained above is used as a search query to retrieve similar contents, and the results are evaluated using two criteria:

the distribution of similar contents between "neighbors" in the routing circle
the % of similar contents retreived within a given number of "hops" from the target node.

These two critera are closely related; if the similar contents are highly clustered among neighboring nodes, it follows that a greater percentage of similar contents will be retrieved with a low number of hops.

First pass

I think for our purposes, we don't need to bother with using the actual adult census data (and muck about with transforming into vectors). We should be able to determine how effectively the contents are distributed using randomly generated keys.

So, the basic idea is to generate a set of random key/value pairs and store them on the DHT. For each of the thresholds we want to test (in the paper they used 0.7, 0.8 and 0.9), we do the following:

For each key in the dataset, locally determine the set of similar keys within the threshold
Query the DHT using the get-similar command with the search key and threshold, starting with a hop count of 0. Record the number of similar results returned.
Increase the hop count by 1 and record the number of new results returned.
Repeat until we've retrieved all the known similar keys.

In the end, for each search key we should have data similar to:

{:search-key 0xF00BA9
 :thresholds 
 {0.7 {:total-matches 73 :hop-results [32 20 11 5 3 1 1]}
  0.8 {:total-matches 42 :hop-results [20 8 7 5 2]}}}

where :total-matches is the number of known similar keys for that threshold, :hop-results is a vector of the number of results returned for each hop.

From that we should be able to plot the frequency distribution of similar keys and calculate the recall percentage for various hop counts, as in the paper.

pHash test

Assuming we get decent results using random keys, we can then do essentially the same test on the pHash image dataset and see how that stacks up. It's likely that the pHash keys will not be uniformly distributed across the ID space, so we'll have to determine what kind of impact that will have on e.g. load balancing (which is not addressed in the Hamming DHT paper apart from some hand-waving).

Progress

Since writing the above we've done the following:

added the emu clojure namespace to create emulated networks for testing at the REPL
implement clojure functions for generating IDs with a distribution of similarity ranges
refined the "hop counting" strategy to tag retrieved results with the number of hops required, which is much faster than iteratively increasing the hop count.
run several preliminary tests at the REPL with as-yet inconclusive results.

To elaborate on the last point, the tests I've run to date have not been able to demonstrate a clear advantage in either the number of hops or the distribution of keys. There are a few possible explanations for this:

The algorithm that extends the getSimilar search to neighboring nodes is flawed, and doesn't contact the true "neighbors" of the responsible nodes.
The generated datasets don't have enough similar keys for trends to be visible / significant. This seems likely, as the majority of generated keys have fewer than 3 similar keys in the datasets I've been testing with. Tweaking the distribution of generated keys and/or only testing keys with multiple similar keys might help.
The analysis of the test results is flawed / misleading.
No advantage exists for HammingChord vs Chord.

If it is the case that no advantage exists, it's also possible the Hamming DHT is only effective when the RHH hash function is used, contrary to our assumptions. That would be unfortunate... It's probably worth generating some random vectors and hashing them with RHH to see if the results are more in line with the paper.

mediachain / overlayweaver