kozlovsky / tribler

Privacy enhanced BitTorrent client with P2P content discovery
http://www.tribler.org
GNU Lesser General Public License v3.0
0 stars 0 forks source link

[Draft 2] An alternative algorithm for calculating torrent popularity #2

Open kozlovsky opened 4 years ago

kozlovsky commented 4 years ago

Right now, we have an implemented algorithm for gossiping torrent popularity and want to:

In my opinion, it is not an easy task to do. We use this algorithm as a telescope to look at the world around us through it and have no idea if this telescope is working or if it is broken:

As a possible solution, I propose to implement an alternative algorithm that uses a different strategy, and then compare the results of two algorithms.

In the following explanation, I will call the current algorithm as PC1 (Popularity Community 1), and the new proposed algorithm as PC2 (Popularity Community 2).

First, a few words about the current algorithm, PC1. Looking into its implementation, I noticed the following:

And now, to my main observation: the current algorithm cannot combine data obtained from the different sources. If we have two trackers specified for the same torrent, and these trackers return the different number of seeders (like 1000 and 1500), we cannot sum these numbers up, as some seeders may connect to both torrents, and we will count them twice. The same is with the results from other sources.

So, the current algorithm, when getting results from the different data sources, just takes max of the received numbers. And when these number has gossiped to another peer, it does not combine with the previous data that peer has but replace them entirely.

The alternative algorithm I want to propose works in a totally different way:

To implement this algorithm, we can use the HyperLogLog counter of seeders for each infohash (Initially, I wanted to use bloom filters, but Vadim suggested using HyperLogLog counters, and I really liked that idea). The HyperLogLog counter has the following characteristics:

Now to the (simplified) description of the PC2 algorithm:

Following this algorithm, eventually, each Tribler node will contain a total count of unique seeders for each live torrent. But, as a HyperLogLog counter cannot decrease, we need to clear the obsolete information somehow. For doing this, we can split data into time buckets (for example, 1-hour intervals), and gather information for each time interval separately. In this case, the data structure will look as {time_interval: {infohash: seeders_counter}}, and the packet transferred to peers will contain time_interval as well. For the user in the GUI, we can display an interpolated value between previous and current time intervals.

When the node shuts down for an extended interval (say, for a week), we want to display the popularity information to the user after he relaunches the node. For doing this, before shutting the system down, we can store estimated seeders count into the TorrentState entity and use it as a previous value for interpolation when the node launches again.

Now about possible attacks. There are two types of attacks that look especially relevant here. The first is the attack when some node wants to suppress the spreading information about some popular torrent. As I already said, if there is at least one uncompromised path between good nodes, the evil node will not be able to suppress information spread, as the same information will be translated through a different path.

The second attack, when some node wants to inflate seeders count for some torrent, is harder to prevent using the proposed algorithm (but it is a more seldom type of attack, as it seems). It is possible to "pollute" counter with a big number of random hashes, to give the impression that there are a lot of seeders. As a counter-measure, it is possible to add the second HyperLogLog counter, which counts "disappointment" of the node, which attempts to connect to a supposedly popular torrent only to see the low number of seeders. Then we can have some heuristic about what to do with suspicious torrents.

But as I said, right now, my proposal is to have the second popularity algorithm, PC2, just to estimate the data obtained from the current algorithm PC1. It is not necessary to display PC2 results to end-users, so at this moment, nobody will have an incentive to attack it.

So what do you think, is implementing an alternative algorithm a good way to understand the results we have from the current algorithm?