[Draft 2] An alternative algorithm for calculating torrent popularity

Right now, we have an implemented algorithm for gossiping torrent popularity and want to:

evaluate its results;
get an idea of the statistics of Tribler usage "in the wild" based on these results.

In my opinion, it is not an easy task to do. We use this algorithm as a telescope to look at the world around us through it and have no idea if this telescope is working or if it is broken:

The algorithm may have conceptual problems, for example, the slow speed of information propagation, due to which the results of its operation may not show the real state of the network;
Also, the implementation of the algorithm may contain bugs that distort the results.

As a possible solution, I propose to implement an alternative algorithm that uses a different strategy, and then compare the results of two algorithms.

it is not about the new algorithm being better. The point is that by gaining popularity in two independent ways, we can see a more objective picture. As Johan said, the researcher's community have worked on these ideas for tens of years. So it is unlikely I can come up with a better algorithm, which is robust, scalable, bias-free, and also protected from possible attacks. But I just want to have a second telescope to estimate the results of the first one.

In the following explanation, I will call the current algorithm as PC1 (Popularity Community 1), and the new proposed algorithm as PC2 (Popularity Community 2).

First, a few words about the current algorithm, PC1. Looking into its implementation, I noticed the following:

PC1 relies heavily on the seeders count received from torrent trackers. In my understanding, we want Tribler to be independent of any external services and self-sufficient, so it may be wrong to build popularity results on the info received from trackers as the primary source of data.
The second source of information is the BEP33 data. It is better, but it is the result of some external algorithm, implemented by a libtorrent plugin. We probably don't know how reliable the data is, and so cannot use it as the main source of data as well.
The third source of data is the results obtained from DHT. If I understand correctly, it may report some subset of seeders and not the full count.

And now, to my main observation: the current algorithm cannot combine data obtained from the different sources. If we have two trackers specified for the same torrent, and these trackers return the different number of seeders (like 1000 and 1500), we cannot sum these numbers up, as some seeders may connect to both torrents, and we will count them twice. The same is with the results from other sources.

So, the current algorithm, when getting results from the different data sources, just takes max of the received numbers. And when these number has gossiped to another peer, it does not combine with the previous data that peer has but replace them entirely.

The alternative algorithm I want to propose works in a totally different way:

it counts the real number of unique torrent's seeders to whom we (or some other node) were able to connect to. This number is not a total swarm size but represents real peers and, in that sense, looks more reliable.
it can combine unique seeders connected to different peers. So, when combining these numbers, we can estimate the total swarm size pretty well.
when all Tribler nodes for any reason hesitate to connect to some torrent to obtain real seeders, this torrent will not appear in the popularity result list (which is probably a good thing?), even if the torrent's tracker reports some number of seeders.
using the gathered data, it is possible to count a total number of peers that seed at least one torrent and use this number as an estimate of the overall system liveness.
the algorithm has inherent protection from one type of attack: for some torrent, an attacker can't propagate zero number of seeders, as the numbers of unique seeders obtained from different nodes sum up, and it is not possible to decrease that counter's value.

To implement this algorithm, we can use the HyperLogLog counter of seeders for each infohash (Initially, I wanted to use bloom filters, but Vadim suggested using HyperLogLog counters, and I really liked that idea). The HyperLogLog counter has the following characteristics:

it can count distinct values. If we put the same hashed seeder id into it several times, the counter's value will not increase;
while HyperLogLog does not keep the exact number of unique elements, it can calculate a probabilistic approximation of the total element's count with high precision;
it is possible to merge two HyperLogLog counters, and so we can combine the number of unique seeders obtained from different nodes, as well as the total number of unique seeders across all torrents;
it is pretty compact (much smaller than a bloom filter with a similar capacity and precision).

Now to the (simplified) description of the PC2 algorithm:

Each Tribler node with PC2 community stores in memory the dictionary of the following form: {infohash: seeders_counter}, where seeders_counter is the HyperLogLog counter of unique seeders for that infohash value. In this dictionary, we store only infohashes for which the numbers of seeders is greater than zero, so no dead torrents here;
Also, there is a helper structure to find random infohash from this dictionary efficiently.
When our node has connections to some torrent seeders, it regularly put seeder's hashes into the HyperLogLog counter of that torrent. As only distinct values are counted, the counter value will grow only when new seeders appear.
At regular intervals, the algorithm selects two random pairs (infohash, seeder_counter) from the dictionary and decides which counter value to propagate to the neighbor peers. In most cases (like 90%), it chooses a counter whose estimated value is bigger, but with some small probability (10%), it selects another counter. After that, the node sends this pair of (infohash, serialized_seeder_counter) to one random peer.
the peer, after receiving this data, merges the received counter with the previous counter it had for specified infohash value. Or, in case of a new infohash, creates a new entry with received HyperLogLog counter value.

Following this algorithm, eventually, each Tribler node will contain a total count of unique seeders for each live torrent. But, as a HyperLogLog counter cannot decrease, we need to clear the obsolete information somehow. For doing this, we can split data into time buckets (for example, 1-hour intervals), and gather information for each time interval separately. In this case, the data structure will look as {time_interval: {infohash: seeders_counter}}, and the packet transferred to peers will contain time_interval as well. For the user in the GUI, we can display an interpolated value between previous and current time intervals.

When the node shuts down for an extended interval (say, for a week), we want to display the popularity information to the user after he relaunches the node. For doing this, before shutting the system down, we can store estimated seeders count into the TorrentState entity and use it as a previous value for interpolation when the node launches again.

Now about possible attacks. There are two types of attacks that look especially relevant here. The first is the attack when some node wants to suppress the spreading information about some popular torrent. As I already said, if there is at least one uncompromised path between good nodes, the evil node will not be able to suppress information spread, as the same information will be translated through a different path.

The second attack, when some node wants to inflate seeders count for some torrent, is harder to prevent using the proposed algorithm (but it is a more seldom type of attack, as it seems). It is possible to "pollute" counter with a big number of random hashes, to give the impression that there are a lot of seeders. As a counter-measure, it is possible to add the second HyperLogLog counter, which counts "disappointment" of the node, which attempts to connect to a supposedly popular torrent only to see the low number of seeders. Then we can have some heuristic about what to do with suspicious torrents.

But as I said, right now, my proposal is to have the second popularity algorithm, PC2, just to estimate the data obtained from the current algorithm PC1. It is not necessary to display PC2 results to end-users, so at this moment, nobody will have an incentive to attack it.

So what do you think, is implementing an alternative algorithm a good way to understand the results we have from the current algorithm?

kozlovsky / tribler

[Draft 2] An alternative algorithm for calculating torrent popularity #2