Open vrde opened 4 years ago
maybe similar/dupe of https://github.com/ipfs/go-ipfs/issues/7211 ?
How long is the download stuck? Is it possible to dial node_0
from a new node? If you restart node_1
, does it work? If you explicitly re-connect node_1
to node_0
does it work?
I'm wondering if node_1
has stale addresses (ports, in this case) for node_0
and keeps trying the wrong ones.
How long is the download stuck?
I don't remember exactly how much, but I think it was for a long time, like 10/20 minutes maybe. I can give it another shot of course.
Is it possible to dial
node_0
from a new node?
Which command should I use for that?
If you restart
node_1
, does it work?
I remember I was able to get it work again by stopping the command (ctrl+c) and running it again.
If you explicitly re-connect
node_1
tonode_0
does it work?
Which command should I use for that?
Is it possible to dial node_0 from a new node? Which command should I use for that?
ipfs swarm connect /p2p/$PEER_ID_OF_OTHER_NODE
I was trying to determine if your node was dialable. However, if stopping and re-starting the command fixes the problem, it probably is dialable and this isn't the problem.
I remember I was able to get it work again by stopping the command (ctrl+c) and running it again.
That points to an issue in bitswap itself. Could you run:
On node_1
:
> ipfs bitswap wantlist
> ipfs swarm peers --streams | grep -A5 $node_0
And the following on node_0
:
> ipfs bitswap wantlist -p $node_1
> ipfs swarm peers --streams | grep -A5 $node_1
That will tell me if:
node_1
is asking for.node_0
thinks node_1
wants.i am not the original poster but i'm also hit by this. a little bit different situation - for me node_1 is on flaky wifi connection but the outcome is the same...
node_0: QmYLhBZsWL2MQuNXFfn6Wr5snTTq2ys121nQ9FQoZnMLYa node_1: Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML
on node_1:
ipfs swarm peers --streams|grep -A5 QmYLhBZsWL2MQuNXFfn6Wr5snTTq2ys121nQ9FQoZnMLYa
/ip6/2a01:9420:9:301:e11e:2ca4:ea5c:5065/tcp/4001/p2p/QmYLhBZsWL2MQuNXFfn6Wr5snTTq2ys121nQ9FQoZnMLYa
/ipfs/bitswap/1.2.0
/ipfs/lan/kad/1.0.0
ipfs bitswap wantlist (+00:00:00) 2020-05-26 22:58:39
QmYhK5uZSrrxrVcpdZyaNTwNBUeXszy4DgWTDENypghMmm
QmcWQMg3MqRx1j5QXrq4JfyZ53NGMGt9M79meH7xUMdrke
QmdMojSFbivBaKVeGhdxvjAzhdWDAiDFq6fNKMH2XFQgWA
Qmf97jbWtmn2ZRAU6aXovYQZiMdCP5A9YFm9SH1igZM94E
on node_0:
ipfs swarm peers --streams|grep -A5 Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML (+00:00:00) 2020-05-26 17:56:43
/ip6/2002:2e24:2741:250::2/tcp/4001/p2p/Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML
/ipfs/bitswap/1.2.0
ipfs bitswap wantlist -p Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML
<NOTHING>
sometimes i see this on node_0:
ipfs swarm peers --streams|grep -A5 Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML (+00:00:00) 2020-05-26 18:04:34
/ip6/2002:2e24:2741:250::2/tcp/4001/p2p/Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML
/ipfs/bitswap/1.2.0
/ipfs/bitswap/1.2.0
/ipfs/bitswap/1.2.0
/ipfs/bitswap/1.2.0
/ipfs/bitswap/1.2.0
...for me the situation is so bad that i have been unable to get anything from "bitswap wantlist -p" on node_0 even after restarting "ipfs get" and even restarting "ipfs daemon" on node_1 :-(
also:
ipfs version --all (+00:00:00) 2020-05-26 23:11:00
go-ipfs version: 0.5.1-8431e2e87
Repo version: 9
System version: amd64/linux
Golang version: go1.14.2
is there a way to force a "wantlist resend" on node_1 (to node_0)?
is there a way to force a "wantlist resend" on node_1 (to node_0)?
We send a full wantlist every 30s. However, we don't send every wantlist to every peer, only to peers that we think have the content. But when the download stalls, we should broadcast to all connected peers so something fishy is going on here.
/ipfs/bitswap/1.2.0 /ipfs/bitswap/1.2.0 /ipfs/bitswap/1.2.0 /ipfs/bitswap/1.2.0 /ipfs/bitswap/1.2.0
That can happen when sending/receiving blocks.
Could you try the new RC (ipfs-update install v0.6.0-rc1
)? We had a potential run-away timeout that could maybe cause this issue. I think this is a new bug, but it would be good to confirm that.
@dirkmc thoughts?
same behaviour with 0.6.0-rc1 (compiled from git). -> i can see some streams on both nodes but the wantlist does not seem to be getting transferred. :-(
also, this may be somehow connected to a "provide" feature/bug - currently on node_1's wantlist: QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU
node_0:
ipfs cat QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU | md5sum
bbfe0cc17f6a01caf8434c858e0caa34 -
...so the block is there. but:
on both node_0 and node_1:
ipfs dht findprovs QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU
<NOTHING>
after almost two minutes.
so it's almost like node_0 is not properly announcing its data and combined with "we don't send every wantlist to every peer, only to peers that we think have the content" this may lead to this download-being-stuck situation.
after a forcefull:
ipfs dht provide QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU
on node_0 i now correctly see:
ipfs dht findprovs QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU
QmYLhBZsWL2MQuNXFfn6Wr5snTTq2ys121nQ9FQoZnMLYa
on both node_0 and node_1 but even-thou there's still connection between the nodes, QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU still stays in node_1's wantlist and not appearing on node_0's wantlist listing from node_1. :-(
...even after running the findprovs. i'd expect that manually finding the provider on node_1 (which returns node_0 and considering we have connection to that node) would immediatelly solve the situation.
so it's almost like node_0 is not properly announcing its data and combined with "we don't send every wantlist to every peer, only to peers that we think have the content" this may lead to this download-being-stuck situation.
Sorry, we don't send the wantlist to all peers until we timeout. Then, we broadcast to all peers. That's why this is really strange.
is there anything i can do to further investigate the issue?
@rpodgorny & @vrde thanks for taking the time to follow up on this issue, it really helps us a lot to receive these bug reports.
I've been trying to reproduce locally with some unit tests inside the go-bitswap project but I haven't been able to so far. It may help understand where the problem is if you output some debug logging while performing the ipfs get
, on both the requesting node and the responding node.
To output debug level logging in the daemon window for a particular sub-system run the ipfs log level
command from another window, eg:
ipfs log level bs:sess debug
The subsystems that may output useful information:
bs:sess
Bitswap sessions: The session keeps track of all the outgoing requests for a particular file. The debug output should tell us when the session broadcasts a request (meaning it can't find a peer that has the blocks it wants) and when the session receives blocks that it had asked for, eg
2020-05-28T13:00:53.812-0400 DEBUG bs:sess session/session.go:218 Bitswap <- block {"local": "13bZk1tR4", "from": "13a5CQiao", "cid": "QmRFfdpSr7hvPYUNzPcKgvtQok95XGGnUKopTBy7LPa5FN", "session": 1}
bs:sprmgr
Bitswap session peer manager: Tells us when a peer is added to / removed from the sessionengine
The Bitswap "engine": The component of Bitswap that receives incoming requests for blocks. It should tell us if someone asks us for blocks, and output whether we have those blocks.bitswap
General bitswap log output: This is quite verbose so if you turn it on you may need to grep through it for things you are looking for. The most useful lines here are probably the ones from messagequeue.go. They tell us the want messages that get sent from the requesting node, eg
2020-05-28T13:00:53.811-0400 DEBUG bitswap messagequeue/messagequeue.go:638 sent message {"type": "WANT_BLOCK", "cid": "QmNTPrxjHgastHFGJa3eJ8q6dqVaM4tJ3bFuVGhCz1E114", "local": "13bZk1tR4", "to": "13a5CQiao"}
bitswap_network
Bitswap networking operations: Very verbose@rpodgorny wrote:
maybe similar/dupe of #7211 ?
I don't think so. 7211 is more about the loss of DHT functionality after a disconnect, where it isn't properly restored/detected.
I'm currently checking it for the master, and it seems to be gone. :)
That's why I haven't updated the ticket anymore for some time.
Version information:
Description:
ipfs get
gets stuck if the other peer reconnects to the network.In this example
node_0
is my laptop (behind NAT),node_1
is my VPS with a public ip.node_0
: add a file and get theCID
.node_1
: runipfs get <CID>
.node_0
: internet goes down.node_0
: internet goes up again.node_1
download is now stuck.Note that
node_0
is the only one that can serveCID
.