A question about k-bucket split

zhiqiangxu commented 4 years ago

Why does each node have to maintain full awareness of its neighborhood while having only vague awareness of remote keyspace portions ?

Or say, why can't its neighborhood also be vague ?

the8472 commented 4 years ago

What's the context of the question?

Anyway, it's necessary as a fundamental requirement of kademlia. Lookups targeting any specific ID only work if the accuracy of the knowledge increases the closer you get to the target, which means the nodes closest to the target must have the most accurate knowledge, which means about their neighbors.

zhiqiangxu commented 4 years ago

Yup that's very intuitive, but is there a rigorous proof that one will not be able to find a target node if k-closest neighbours assumption doesn't hold?

I'm trying to dig deeper into the rationale of kademlia~

I read some other papers which developed some other concepts like close region and far region, which also requires the k-closest neighbours to be in the close region.

the8472 commented 4 years ago

but is there a rigorous proof that one will not be able to find a target node if k-closest neighbours assumption doesn't hold?

The Kademlia paper only concerns itself with proving that its goals are achieved with high probability when its assumptions are upheld. There is little point in proving alternative schemes not working if those alternative schemes are not in the scope of the paper.

And asking for rigorous proof would require a rigorous definition of "vague neighborhood". Given sufficiently inaccurate routing tables one can certainly construct adversarial examples where two different nodes starting a lookup towards a specific target key will end up with disjoint final sets. In the wild this actually happens due to a combination of

naive lookup algorithms
non-robust implementation of routing table maintenance
malicious nodes exploiting above two points to harvest traffic

So if you're asking whether alternative approaches could theoretically exist, sure. The possibility is not excluded. But they would still require some sort restrictions on what gets into the routing to be able to converge on same the final set with high probability.

zhiqiangxu commented 4 years ago

Em, it's all about math actually, this algorithm may seem extremely obvious for mathematicians, but not that obvious for me yet, lacking some background in some math theory.

I tend to model the process this way so far:

When a new nodeID A joins a network by attaching to an existing node B, if it's in the close region of B, then add A to the close region and split if necessary, in either cases, A will not be dropped （A is reachable from the network since it's not dropped, no magic here）; otherwise if A is in the far region of B, which means the corresponding k-bucket is full, A can be safely dropped from the routing table of A, it's guaranteed that some node within that k-bucket will be able to reach A somehow. That somehow is the magic I'm considering ~

the8472 commented 4 years ago

if it's in the close region of B, then add A to the close region and split if necessary,

If you always split if it's close-by then you would essentially maintain full knowledge of your neighborhood (assuming maintenance happens). That is essentially what kademlia does. Split if it's in the keyspace region covered by your current home bucket, otherwise just fill the existing buckets. Modulo the unbalanced tree case.

zhiqiangxu commented 4 years ago

I noticed this answer: https://stackoverflow.com/a/30655403/3382012

You mentioned that "The correct algorithm is fairly involved, a linear scan over the whole table is easier to implement", and it seems to me that the involved algorithm doesn't even guarantee less time complexity than linear scan, because it may spend too much time in searching of non-existing nodes..

the8472 commented 4 years ago

it seems to me that the involved algorithm doesn't even guarantee less time complexity than linear scan

Its best case time complexity is logarithmic. Anyway, I believe your initial question has been answered and it's more a general kademlia question anyway, not specific to the source code of this repository.

If you want to ask theoretical questions then SO might be a better choice.

the8472 / mldht

A question about k-bucket split #29