the8472 / mldht

Bittorrent Mainline DHT implementation in java
Mozilla Public License 2.0
147 stars 45 forks source link

Multihoming question #26

Closed MaticSincek closed 4 years ago

MaticSincek commented 4 years ago

Hi.

For research purpose am trying to simulate a Sybil attack in the network. For that I would need multiple nodes on the same machine with different nodeIDs. I am working with a local address within the home network.

Is this possible to do and does multihoming have something to do with it?

I have tried to run the nudes simultaneously - from Intellij and it SEEMS to work, but should it?

My guess was that if I just set up multiple nodes without multihoming Ports will get messed up and the nodes wouldn't know which message is for which, so I would have to port forward them.

I am a rookie in network coding. Any insight is much appreciated.

Thank you in advance, M

the8472 commented 4 years ago

Multihoming enables a single DHT instance under multiple public identities (RPCServer instances) while internally maintaining shared state for all those public identities. Thus it knows that all its public identities are its own and will behave differently compared to them being genuine remote, untrusted peers. Multihoming is also restricted to publicly routable unicast addresses. I think the former is more of an issue for research purposes, the latter can be easily solved by either removing the checks or internally assigning IP addresses that are normally considered public.

Operating multiple nodes that behave as if the other ones are untrusted remote peers indeed requires creating multiple DHT instances.

Ports will get messed up and the nodes wouldn't know which message is for which, so I would have to port forward them.

Ports themselves are not an issue. DHT contacts are tracked as <ip, port, node ID> tuples and thus multiple nodes operating on the same IP under different ports can be distinguished. But to defend against sybil attacks (heh) mldht has some restrictions how many nodes per IP address it will allow in its routing table and inside the temporary state of iterative lookups. It is also coded with the assumption that it operates on the open internet. While the bittorrent DHT can in principle be operated in non-public networks that would lead to a split horizon and to avoid dealing with that it filters out any non-public IPs. I have considered making this configurable, but this isn't the case yet.

So for research purposes you will need to assign hundreds or thousands of public IP addresses to an interface and then bind each DHT to one of those addresses via its configuration. This shouldn't be too difficult on linux. You can even do it in an isolated network namespace to avoid messing up the initial namespace's routing or contacting nodes on the internet.

I don't know how scalable my implementation is when you're using separate nodes. I have tested it in single-node, multi-homed mode with about 256 IP addresses and it was able to handle the traffic of the in-the-wild bittorrent DHT (millions of nodes) with CPU load equivalent to 1-2 cores.

Any insight is much appreciated.

If you haven't already, take a look at use-as-library.md, especially the debugging section, I have diagnostic/debug output for almost everything.

Depending on what you wish to simulate sometimes it might also be possible to take shortcuts. E.g. creating an idealized network could consist of simply creating a million Node IDs (represented by the Key class) and then instantiating a routing table and populating the routing tables without going through the network. You might have to rip out some code or put them in the same package to get package-private access to some classes to get them instantiated without dependencies. Then again, some of the defense algorithms are only implemented at a higher level and not performed if you access classes through their low-level interfaces.

MaticSincek commented 4 years ago

Thank you for such a quick and comprehensive response.

I understand there is a limit of nodes per IP and if I understood you right I think what I'm trying to do is quite simpler. I am trying to recreate the experiment in this article called On Understanding the Existence of a Deep Torrent (section IV).

What it does is it collects nodes in some 8 bit prefix area (for example: C....... ........ ........ ........ ........) into the "known_list" and then it we should create sybil nodes (close to the main node and nodes in the known list) that ping those nodes in known list resulting in insertion of sybils in the nodes from the known list. That way we should recieve same messages as the nodes in the known list(because the announcements might go through us). What we really want is to obtain infohash announcements of those nodes.

I know this is not the scope of your library, but I could use your opinion on it since you know a lot if not everything about the network. If we are looking at 5 sybil nodes that will be inserted into let's say 300 nodes(example) in the prefix zone we should get some announcements before the routing table entries are replaced? Is this right?

I hope this makes sense. Again thank you for the explaination, M

the8472 commented 4 years ago

I am trying to recreate the experiment in this article called On Understanding the Existence of a Deep Torrent (section IV).

If all you want is to obtain some infohashes then that experiment is crude and unnecessarily hostile to the DHT for various reasons. First of all there's BEP51 that makes it much simpler to sample infohashes from remote DHT nodes. Secondly, even without that obtaining samples does not require performing a sybil attack on any narrow keyspace region. You simply need a bunch of node IDs (and IP addresses) scattered at random node IDs in the keypace and passively observe traffic. Due to the iterative way kademlia works you will also receive messages for destinations that are far from their target, so given enough IP addresses and time you'll get a pretty big sample without actively doing anything non-compliant. In fact those two approaches is how the TorrentDumper component operates.

If we are looking at 5 sybil nodes that will be inserted into let's say 300 nodes(example) in the prefix zone we should get some announcements before the routing table entries are replaced? Is this right?

The kademlia algorithm maintains routing tables with with a preference for the oldest alive nodes, which means simply pinging a node does not automatically get you into their routing table. At least when they're implemented properly. Granted, in practice many implementations are poor and the approach (or similar) do work, but why use hostile behavior when asking nicely works too?

MaticSincek commented 4 years ago

Seems fair. I was relying on the article too much. I will try the 2 mentioned methods. Thanks for your feedback.

MaticSincek commented 4 years ago

Sorry, I just have one more follow up question I hope you can answer me.

In the KeyspaceSampler (Which sould be the class implementing the BEP 51) what exactly is the role of the Prefix range and NodeLookup seed? My guess is I can't specificaly target a node with specific Id, and must target a range if IDs. I also can't just set the prefix with depth the same length as the key.

The way I did it I make it is:

Key key = new Key(toProcess.getID().toString(false)); Prefix prefix = new Prefix(key, 1); key = prefix.first();

and then I started first the Nodelookup (seed) and then KeyspaceSampler:

            NodeLookup nodeLookup = new NodeLookup(key, rpc, dht.getNode(), false);
            nodeLookup.start();

            while(!nodeLookup.isFinished()){
                try {
                    Thread.sleep(100);
                } catch (Exception e) {
                    System.out.println(e.getMessage());
                }
            }

            KeyspaceSampler sampler = new KeyspaceSampler(rpc, dht.getNode(), prefix, nodeLookup, null);
            sampler.start();

I know the first code segment is wrong, but because of this restriction

if (!seed.getTargetKey().equals(range.first()))

I didn't know how to do it the other way, but it gets me sample infohashes, though not from the nodes I want. What I tried to do in the first place is target the nodelookup with an ID and make a prefix with the same key and depth of 160 bits which triggered an exception. 159 != exception, but 160 is.

I would sort of like to target specific node IDs, would this be possible? Also if seed and range must be the same, why do we have to suply both? I probably don't understand something here at all

the8472 commented 4 years ago

while(!nodeLookup.isFinished()){

You can wait for completion via task.addListener(...)

what exactly is the role of the Prefix range and NodeLookup seed? My guess is I can't specificaly target a node with specific Id, and must target a range if IDs. I also can't just set the prefix with depth the same length as the key.

The keyspace sampler is an iterative lookup task which aims to sample the infohashes from some sub-range in the DHT keyspace (or possibly the entire keyspace) as efficiently as possible, starting and expanding from an initial set found through the NodeLookup. The prefix is the portion of the keyspace that's enumerated.

/docs/keyspace-traversal.md describes how it works.

If you only want to issue perform a single request-response pair to a single known node you have to issue an RPCCall with a request message of the type SampleRequest via an RPCServer. Unlike tasks they have less pacing and back-pressure management, so you'll have to be careful to not spam these. It's is a low level API.

Or you can implement your own Task subclass that provides the lookup behavior you want. I guess PingRefreshTask would be the simplest starting point, since it just pings a bunch of nodes in a prepoulated k-bucket.

MaticSincek commented 4 years ago

Thank you.