libp2p / go-libp2p-kad-dht

A Kademlia DHT implementation on go-libp2p
https://github.com/libp2p/specs/tree/master/kad-dht
MIT License
524 stars 225 forks source link

PutValue appears to not put to K peers #139

Closed whyrusleeping closed 3 years ago

whyrusleeping commented 6 years ago

In order for queries to function properly, PutValue should ensure it puts out to K (20) peers. It currently seems that this is not happening, and that a much smaller number of records (less than 16, our GetValue threshold) is actually being put.

This was brought up in https://github.com/ipfs/go-ipfs/issues/3860

We should investigate and fix this.

inetic commented 6 years ago

I've been testing go-ipfs as is currently in master (sorry, I'm still quite new to go/gx so don't know to which commit it translates here, but the ipfs version is QmY1y2M1aCcVh...).

I've added a few fmt.Printf(...) lines to routing.go/PutValue and routing.go/GetValues. Basically I added a line

fmt.Printf("routing.go/GetValues query reply from %v %q hasval:%v err:%v\n", p, key, rec.GetValue() != nil, err)

Right after the call to dht.getValueOrPeers in routing.go/GetValues and I added these lines:

if err != nil {
    fmt.Printf("routing.go/PutValue failed putting value to peer %v %q\n", p, err)
} else {
    fmt.Printf("routing.go/PutValue success putting value to peer %v\n", p)
}

Right after the call to dht.putValueToPeer(ctx, p, key, rec) in routing.go/PutValue.

Then I started the a program that does the publishing in a loop (with few seconds in between publishing) and got this output. That is, the publisher never actually succeeded in publishing the value to any node (am I reading it correctly?).

Once the call to the first publish in the loop was done, I fired another program which started resolving that IPNS (in a loop as well). The output is here. That is, the only peer from which the resolver was able to get the value first 4 times was the publisher itself (look for the hasval:true string). Later, one more node (ID atonFX) came to know that value as well, although the publisher's PutValue function never seems to have noticed that it succeeded with that one user.

whyrusleeping commented 6 years ago

@inetic thats disconcerting... If your code is right then i'd say your observation is correct. There are a weirdly large number of 'default failure' errors there that really shouldnt be popping up like that.

whyrusleeping commented 6 years ago

In your case, it looks like you are unable to dial to any of the DHT nodes you are supposed to be connecting to.

whyrusleeping commented 6 years ago

@inetic question, how many peers show up in ipfs swarm peers ?

inetic commented 6 years ago

There are a weirdly large number of 'default failure' errors there that really shouldnt be popping up like that.

It seems the default failure errors happen for nodes that previously failed with a different error.

inetic commented 6 years ago

In your case, it looks like you are unable to dial to any of the DHT nodes you are supposed to be connecting to.

I'll try to modify the ipfs executable directly tomorrow to make sure it's not my misuse of the go API, as I'm doing direct calls to some of the ipfs functions (1, 2). But in the mean time, do you guys experience better connectivity in the PutValue function?

whyrusleeping commented 6 years ago

Currently, my only internet connection is a tethered LTE connection on a 1GB sim card. I'm trying to conserve data.

Maybe @Kubuxu @vyzo or @magik6k could take a look?

vyzo commented 6 years ago

@inetic were you testing go-libp2p-kad-dht on its own?

inetic commented 6 years ago

@vyzo I did just now, nothing out of the ordinary seems to be there. With these modifications to QmY1y2M1aCcV.../go-libp2p-kad-dht/routing.go

+++ /tmp/go-libp2p-kad-dht.orig/routing.go  2018-04-27 13:56:08.998398626 +0200
@@ -83,11 +83,6 @@

            err := dht.putValueToPeer(ctx, p, key, rec)
            if err != nil {
-               fmt.Printf("PutValue failure %q %v %q\n", p, err, key)
-           } else {
-               fmt.Printf("PutValue success %q %q\n", p, key)
-           }
-           if err != nil {
                log.Debugf("failed putting value to peer: %s", err)
            }
        }(p)
@@ -211,7 +206,6 @@
        })

        rec, peers, err := dht.getValueOrPeers(ctx, p, key)
-       fmt.Printf("getValueOrPeers peercount:%d hasval:%v %v\n", len(peers), rec.GetValue() != nil, err)
        switch err {
        case routing.ErrNotFound:
            // in this case, they responded with nothing,

I'm seeing this output.

inetic commented 6 years ago

So I've done the same test with only go-ipfs sources. Basically what I've done is this:

  1. Install fresh 1.9.5 go
  2. Compile and install newest go-ipfs from github
  3. Add the fmt.Printf debug printing to routing.go/PutValue
  4. Run the daemon, wait for it to get ready
  5. Run ipfs name publish Qm...

Here Here is the script in detail (and here the printf.patch file which the script uses).

Unfortunately, my node still doesn't connect to any other peer when doing PutValue. The log can be found here.

EDIT: Improved the script

inetic commented 6 years ago

@whyrusleeping sorry, missed your question yesterday

question, how many peers show up in ipfs swarm peers ?

In the above tests, I started publishing very soon after the daemon printed "Daemon is ready".

Now I modified the script again so that it prints swarms, I waited about a minute to get ~100 peers in the swarm and then started publishing. This time the PutValue did succeed with 7 peers (out of 80). Here is the log.

inetic commented 6 years ago

One more test: I did the same as above, but this time I did it on a DigitalOcean instance and I waited for ~200 peers. Unfortunately, did not succeed in connecting to any peer. Here is the log.

whyrusleeping commented 6 years ago

So something is up here. Either your node isnt able to dial properly for some reason (could be bad addresses in the network, or something) or there are a huge number of undialable nodes in the network.

vyzo commented 6 years ago

I tried a test IPNS publish from my node -- it took 2m49s and I am seeing lots of failed putting value to peer errors.

Note I am running master (no printf patches) on my laptop node.

inetic commented 6 years ago

Either your node isnt able to dial properly for some reason (could be bad addresses in the network, or something)

That's what I was thinking as well and why I did the test on the DigitalOcean's instance which has an open IP to the internet. All the other tests I did were on my home PC which is behind a router.

whyrusleeping commented 6 years ago

Escalating this...

vyzo commented 6 years ago

I wrote a debug crawler that crawls the dht and tries to connect to each discovered peer.

The results after an overnight run:

In short, 61% of the dht is undialable!

Edit: Up to 2268 peers now, 1391 undialable.

MichaelMure commented 6 years ago

FWIW, I get a similar result from a home network:

799 "ERROR"
503 "OK"

61% undialable.

whyrusleeping commented 6 years ago

Thanks @MichaelMure!

We're working on adding more data collection to this crawler. Current approaches we're thinking about are:

florianlenz commented 6 years ago

When I put a value to the DHT it's only put to 3 out of 20 peers (sometimes more sometimes less). I think the error I get panic: routing: not found is related to this issue. The project I am working on is blocked by this issue. Is there something I can help with? Wouldn't fetching peers and putting the key to them till we are sure it was put to 20 peers fix the problem?

whyrusleeping commented 6 years ago

@florianlenz yeah, making sure we put it to 20 peers would help, but we need to make sure those are roughly the same 20 peers that someone doing a lookup of the value will contact. I think changing the code slightly to find the '20 closest peers we can connect to' might help dramatically. I'm doing some experiments right now with DHT resolution, let me know if you try anything and figure something interesting out

whyrusleeping commented 6 years ago

random thing i tried:

reducing the dial timeout in go-libp2p-conn from 60s to 5s. Seems to have made things quite a bit faster overall, without reducing the number of records I get back. (i'm testing DHT 'Gets' for ipns records)

whyrusleeping commented 6 years ago

another interesting thing:

I changed the GetClosestPeers logic to only return peers we 'successfully connected' to (by calling host.Connect on each peer before adding it to our 'good set'). For some reason, i'm still getting puts failing because of 'connection reset'. This is really fishy...

cc @Stebalien

whyrusleeping commented 6 years ago

I also get one put value failed: protocol not supported which implies that we're attempting to use a dhtclient peer in our queries. This is not a good thing, should look into it further.

whyrusleeping commented 6 years ago

So heres a fairly typical run:

2018/06/04 00:15:14 starting putvalue
connected to:  <peer.ID bE9eo6>
Putting value to peer:  <peer.ID bE9eo6>
put value failed:  <peer.ID bE9eo6> EOF
connected to:  <peer.ID XRxDBW>
Putting value to peer:  <peer.ID XRxDBW>
put value failed:  <peer.ID XRxDBW> connection reset
connected to:  <peer.ID Xtjs28>
connected to:  <peer.ID R746Ek>
Putting value to peer:  <peer.ID Xtjs28>
Putting value to peer:  <peer.ID R746Ek>
connected to:  <peer.ID X8wpfh>
Putting value to peer:  <peer.ID X8wpfh>
connected to:  <peer.ID f17W6t>
Putting value to peer:  <peer.ID f17W6t>
connected to:  <peer.ID Q2SKsq>
Putting value to peer:  <peer.ID Q2SKsq>
connected to:  <peer.ID btpiK5>
Putting value to peer:  <peer.ID btpiK5>
put value failed:  <peer.ID btpiK5> connection reset
put value failed:  <peer.ID X8wpfh> connection reset
put value failed:  <peer.ID f17W6t> connection reset
connected to:  <peer.ID fJ4cFr>
Putting value to peer:  <peer.ID fJ4cFr>
connected to:  <peer.ID X9Nhr5>
Putting value to peer:  <peer.ID X9Nhr5>
connected to:  <peer.ID dUEWuW>
connected to:  <peer.ID WUXoqr>
Putting value to peer:  <peer.ID dUEWuW>
connected to:  <peer.ID VqHMQg>
Putting value to peer:  <peer.ID VqHMQg>
Putting value to peer:  <peer.ID WUXoqr>
connected to:  <peer.ID WpPMz4>
Putting value to peer:  <peer.ID WpPMz4>
connected to:  <peer.ID RexRiz>
Putting value to peer:  <peer.ID RexRiz>
connected to:  <peer.ID U8CLyQ>
Putting value to peer:  <peer.ID U8CLyQ>
connected to:  <peer.ID SoLV4B>
Putting value to peer:  <peer.ID SoLV4B>
connected to:  <peer.ID anKQH7>
Putting value to peer:  <peer.ID anKQH7>
connected to:  <peer.ID RfMd5A>
connected to:  <peer.ID Sv3MDA>
20 closest peers were connected to!
Putting value to peer:  <peer.ID Sv3MDA>
Putting value to peer:  <peer.ID RfMd5A>
put value failed:  <peer.ID dUEWuW> connection reset
put value failed:  <peer.ID X9Nhr5> connection reset
put value failed:  <peer.ID WpPMz4> connection reset
put value failed:  <peer.ID RfMd5A> connection reset
put value failed:  <peer.ID fJ4cFr> connection reset
put value failed:  <peer.ID U8CLyQ> connection reset
put value failed:  <peer.ID SoLV4B> connection reset
put value failed:  <peer.ID WUXoqr> connection reset
put value failed:  <peer.ID VqHMQg> connection reset
put value failed:  <peer.ID Q2SKsq> connection reset

20 put attempts, 20 successful calls to host.Connect, 15 failures. fissssshhhhhhyyyyyyyyyyy

whyrusleeping commented 6 years ago

🐟

florianlenz commented 6 years ago

I tried to find a set of 20 peers for a given key which is online and provide a DHT but it took around 30 minutes for me to get that list. So I guess it's not the solution we are looking for.

whyrusleeping commented 6 years ago

These particular connection reset errors are the yamux.ErrConnectionReset kind. 🐟 🍥 🎣

whyrusleeping commented 6 years ago

@florianlenz i don't think it should take that long, we talk to more than 20 peers in our crawl towards finding the 20 closest. So we should be finding 20 acceptable peers along the way. They might just not be the closest we've ever heard about

whyrusleeping commented 6 years ago

yamux returns ErrConnectionReset when the stream gets reset. Very misleading error message.

whyrusleeping commented 6 years ago

Two things.

First off. It turns out, that if a peer doesnt have the public key to validate an ipns record it receives, it just tosses the record and resets the stream. That stream reset was producing the 'connection reset' errors i was seeing, which led me to this bug. Basically, using an ipns key other than your peers key will result in most peers not accepting the record you are putting (aside from the peers that overlap in the keyspace between the closest peers to your record and the closest peers to your public key). This is why both @florianlenz and I were seeing puts only going to like three peers successfully. Its not that we were sending it to them, its that they were just dropping them. This is something we need to address ASAP.

The other thing, This patch makes things work reasonably well:

diff --git a/lookup.go b/lookup.go
index dd47516..d6f04a5 100644
--- a/lookup.go
+++ b/lookup.go
@@ -10,6 +10,7 @@ import (
    pb "github.com/libp2p/go-libp2p-kad-dht/pb"
    kb "github.com/libp2p/go-libp2p-kbucket"
    peer "github.com/libp2p/go-libp2p-peer"
+   pstore "github.com/libp2p/go-libp2p-peerstore"
    notif "github.com/libp2p/go-libp2p-routing/notifications"
 )

@@ -100,13 +101,19 @@ func (dht *IpfsDHT) GetClosestPeers(ctx context.Context, key string) (<-chan pee

        if res != nil && res.finalSet != nil {
            sorted := kb.SortClosestPeers(res.finalSet.Peers(), kb.ConvertKey(key))
-           if len(sorted) > KValue {
-               sorted = sorted[:KValue]
-           }

+           var found int
            for _, p := range sorted {
-               out <- p
+               if err := dht.host.Connect(ctx, pstore.PeerInfo{ID: p}); err == nil {
+                   found++
+                   out <- p
+               }
+               if found >= KValue {
+                   break
+               }
            }
        }
    }()

Basically "Get me the K closest peers that I can actually connect to". Theres definitely room for some parallelism here, and also we should do some analysis on this, but in general, using this, my ipns resolves get all 16 expected values back, and they get them back pretty fast (especially on repeated resolves).

whyrusleeping commented 6 years ago

(brb, getting some dinner)

whyrusleeping commented 6 years ago

So it appears that the lack of access to public keys is a pretty significant issue here... If we started switching over the embeddable ed25519 keys already, we would be okay, but with the RSA its a problem.

One option would be to augment the DHT codebase in a pretty ugly way to make PutValue for IPNS records, first put the public key to the peer, then put the ipns record. This is the easiest change to make from a protocol design standpoint, but actually writing that code will be... rather unpleasant.

Another option would be to add the public key to the ipns record itself so that we guarantee the peer has it. This would require no changes to the DHT codebase, but would require some changes in the IPNS validator, to handle that. This change is easier from the implementation perspective, but requires pretty much everyone update their nodes for it to take effect.

I'm leaning towards the first choice, but it will require some code mangling. I'll have a chat with @Stebalien when he wakes up, need to get this fixed ASAP

florianlenz commented 6 years ago

@whyrusleeping really nice work :) Would it make sense to drop the connection after connecting? A host could end up having a lot off connections when doing a lot of lookups.

whyrusleeping commented 6 years ago

I think its probably best to leave that up to the connection manager. It will be able to make decisions about connections informed by the broader picture than just the DHT

whyrusleeping commented 6 years ago

@florianlenz question, which version of the DHT code are you testing with?

florianlenz commented 6 years ago

I tested with version 4.0.4 (QmSBxn1eLMdViZRDGW9rRHRYwtqq5bqUgipqTMPuTim616)

whyrusleeping commented 6 years ago

Ah, interesting. So you we'rent running with @Stebalien's latest changes (which resulted in gx version 4.1.0). There are some changes there that make this process work a little differently, thanks for the data point :)

whyrusleeping commented 6 years ago

This patch should help for ipns publishing in general: https://github.com/ipfs/go-ipfs/pull/5079 It will at least stop certain peers from throwing away the record you are trying to put (once peers update to that version of the code :/ )

The other bigger problem i see is that (as discussed previously) 'GetClosestPeers' currently returns the 20 closest peers it has heard of. I have a patch locally that gathers all the 'close' peers, sorts them, then tries connecting to them in order of closeness. Once it succeeds in connecting to 20 of them, it returns those 20. This results in a very well work ipns (but my patch is pretty ugly). Most runs go through 40-70 peers before they find 20 connectable ones (implying between 50% and 30% connectivity, which roughly matches what our earlier experiments showed).

I'm going to try out a patch where we try to relay every connection and see how that affects those ratios.

whyrusleeping commented 6 years ago

automatically trying to relay every connection definitely lowers the 'miss rate' on connecting to closest peers. Last few tests i ran show it about 30% lower (we can connect to 30% more peers than before). But for some currently-unknown reason, overall performance is pretty degraded compared to not trying relays. I also end up connected to a lot more peers at the end of a run than before, >300 as opposed to ~100 before... very odd.

Stebalien commented 6 years ago

@whyrusleeping what's the status of your closest peers patch?

whyrusleeping commented 6 years ago

@Stebalien should be much better: https://github.com/libp2p/go-libp2p-kad-dht/pull/159 (merged)

I havent had great internet since then, but I do now, and can start debugging things a bit more. In any case though, we should start pushing these changes out and cut a new ipfs release. Let's get it all deployed!

Stebalien commented 6 years ago

Got it. I asked because I tried publishing an IPNS entry on two different temporary nodes and:

  1. On the first node, multiple publishes all succeeded but my other node only ever saw the first value (even after, e.g., restarting).
  2. On the second node, I was never able to complete a single publish.
whyrusleeping commented 6 years ago

@Stebalien was this on a testnet of nodes all using the proper value? or was this on the main network?

Getting back an old value seems a bit odd though, we should investigate that.

Stebalien commented 6 years ago

Main network.

whyrusleeping commented 6 years ago

Yeah, you're unlikely to succeed much at all trying to publish to the main network right now

On Thu, Jun 21, 2018, 12:21 PM Steven Allen notifications@github.com wrote:

Main network.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/libp2p/go-libp2p-kad-dht/issues/139#issuecomment-399214679, or mute the thread https://github.com/notifications/unsubscribe-auth/ABL4HAl1ljIjQvgZGFBjDIMmAYVLC2DEks5t-_IvgaJpZM4Tfzrq .

Stebalien commented 6 years ago

Oh. Yeah, I forgot about that.

stebalien slinks off into a corner to quietly continue breaking everything

markg85 commented 4 years ago

Hi,

A couple of hours ago i made issue https://github.com/ipfs/go-ipfs/issues/6800 (name resolve takes exactly 1 minute and loses cache after 30 seconds). That issue is closed which is fine. But when reading though this whole thread i'm lead to believe this was (or is?) the actual core issue of the observed behavior.

Now this does reference issue https://github.com/filecoin-project/go-filecoin/issues/437 (filecoin?.. what does that have to do with this very issue?.. anyhow.) where it's said to be no issue anymore (that was on Feb. 27 of 2019) where it was subsequently closed. This issue however remained open.

And then i read much of https://github.com/ipfs/go-ipfs/issues/3860 where it eventually turns out to be a feature in it's current form (if i'm reading it correctly).

So now i'm totally lost if the behavior i see in ipfs name resolve (that takes exactly 1 minute for new publishes) is by design and working properly (thus meaning the issue in this thread is resolved too) or if there still is an issue, as i can't imagine 1 minute is by design and "proper".

Please note, i try to access a newly published website so i'm using a browser to access that ipns link. Command line trickery to speed it up won't work in the browser. I'm guessing i have to modify ipfs companion if i want to try the same command line trickery in the browser?

I'd happily try out a go-ipfs 0.50.0 alpha + IPFS companing to go along with it to see if the actual issue is resolved (aka, have a fast working IPNS) but i have no clue what i'm even supposed to install now. So just a friendly question: Is this issue resolved in the current master that is going to be 0.50.0? Can i help? (by testing stuff out, and how?)

aschmahmann commented 4 years ago

@markg85 your question about the current state of IPNS is better asked on discuss.ipfs.io then here. If you start an issue there I'll happily go through why things are working how they are currently, some of the ongoing work to improve performance, and some areas where having some help would be great (e.g. testing).

markg85 commented 4 years ago

@aschmahmann that's a weird and concerning request. I'm only wondering how it's supposed to work because of said "perceived bugs" and the "hints" i'm catching elsewhere that this is actually a feature. This very bug report should be the place as it's the root cause. If i move to discuss, it will also add yet another place where "ipns performance" is discussed where it's already been discussed in so many different places (multiple threads on discuss, multiple github issues and even branches). Are you sure you want that?

If you still say yes to this then i happily open a new topic there, i just try to prevent even more thinning out of information in this regard.