mediachain / overlayweaver

Overlay Weaver: An Overlay Construction Toolkit
Apache License 2.0
3 stars 1 forks source link

Evaluate effectiveness of Hamming DHT #4

Open yusefnapora opened 8 years ago

yusefnapora commented 8 years ago

This issue is meant to be an outline of our testing and evaluation strategy for the Hamming DHT implementation and a place to collect notes and observations.

Initial testing thoughts

Here's the initial writeup from about a week ago that I posted to slack:


So, the first order of business is to replicate the results of the Hamming DHT evaluation to show that we've implemented it correctly.

Their data is based on applying the RHH function to the UCI Adult Data Set (CSV dumps: both adult.data and adult.test are used).

The dataset contains "profiles" for about 50 thousand adults, taken from census data. Each profile contains 14 attributes, some of which are numeric (e.g. age, hours worked per week, etc), and some are "enum" types (e.g occupation, race, etc). It's not clear from the Hamming DHT paper exactly how the attributes are mapped into a vector representation. I'm guessing that they were treated as 14-dimension vectors, with the range of each dimension determined by the number of enum values for that attribute. The range of the numeric attributes is less clear, but could either be guessed or determined by taking the MIN and MAX of the actual values in the dataset.

At any rate, each profile vector was fed into the RHH function to produce a 128 bit key, and stored in Chord and Hamming-Chord DHTs. Tests were run against DHTs with 1000 peers and 10,000 peers.

Each key obtained above is used as a search query to retrieve similar contents, and the results are evaluated using two criteria:

These two critera are closely related; if the similar contents are highly clustered among neighboring nodes, it follows that a greater percentage of similar contents will be retrieved with a low number of hops.

First pass

I think for our purposes, we don't need to bother with using the actual adult census data (and muck about with transforming into vectors). We should be able to determine how effectively the contents are distributed using randomly generated keys.

So, the basic idea is to generate a set of random key/value pairs and store them on the DHT. For each of the thresholds we want to test (in the paper they used 0.7, 0.8 and 0.9), we do the following:

In the end, for each search key we should have data similar to:

{:search-key 0xF00BA9
 :thresholds 
 {0.7 {:total-matches 73 :hop-results [32 20 11 5 3 1 1]}
  0.8 {:total-matches 42 :hop-results [20 8 7 5 2]}}}

where :total-matches is the number of known similar keys for that threshold, :hop-results is a vector of the number of results returned for each hop.

From that we should be able to plot the frequency distribution of similar keys and calculate the recall percentage for various hop counts, as in the paper.

pHash test

Assuming we get decent results using random keys, we can then do essentially the same test on the pHash image dataset and see how that stacks up. It's likely that the pHash keys will not be uniformly distributed across the ID space, so we'll have to determine what kind of impact that will have on e.g. load balancing (which is not addressed in the Hamming DHT paper apart from some hand-waving).

Progress

Since writing the above we've done the following:

To elaborate on the last point, the tests I've run to date have not been able to demonstrate a clear advantage in either the number of hops or the distribution of keys. There are a few possible explanations for this:

If it is the case that no advantage exists, it's also possible the Hamming DHT is only effective when the RHH hash function is used, contrary to our assumptions. That would be unfortunate... It's probably worth generating some random vectors and hashing them with RHH to see if the results are more in line with the paper.

yusefnapora commented 8 years ago

Most pressing current needs

I feel like the most important things right now are: