ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.13k stars 3.01k forks source link

`ipfs get` doesn't automatically restart the download of a CID #7301

Open vrde opened 4 years ago

vrde commented 4 years ago

Version information:

go-ipfs version: 0.5.1
Repo version: 9
System version: amd64/linux
Golang version: go1.13.10

Description:

ipfs get gets stuck if the other peer reconnects to the network.

In this example node_0 is my laptop (behind NAT), node_1 is my VPS with a public ip.

Note that node_0 is the only one that can serve CID.

rpodgorny commented 4 years ago

maybe similar/dupe of https://github.com/ipfs/go-ipfs/issues/7211 ?

Stebalien commented 4 years ago

How long is the download stuck? Is it possible to dial node_0 from a new node? If you restart node_1, does it work? If you explicitly re-connect node_1 to node_0 does it work?

I'm wondering if node_1 has stale addresses (ports, in this case) for node_0 and keeps trying the wrong ones.

vrde commented 4 years ago

How long is the download stuck?

I don't remember exactly how much, but I think it was for a long time, like 10/20 minutes maybe. I can give it another shot of course.

Is it possible to dial node_0 from a new node?

Which command should I use for that?

If you restart node_1, does it work?

I remember I was able to get it work again by stopping the command (ctrl+c) and running it again.

If you explicitly re-connect node_1 to node_0 does it work?

Which command should I use for that?

Stebalien commented 4 years ago

Is it possible to dial node_0 from a new node? Which command should I use for that?

ipfs swarm connect /p2p/$PEER_ID_OF_OTHER_NODE

I was trying to determine if your node was dialable. However, if stopping and re-starting the command fixes the problem, it probably is dialable and this isn't the problem.

I remember I was able to get it work again by stopping the command (ctrl+c) and running it again.

That points to an issue in bitswap itself. Could you run:

On node_1:

> ipfs bitswap wantlist
> ipfs swarm peers --streams | grep -A5 $node_0

And the following on node_0:

> ipfs bitswap wantlist -p $node_1
> ipfs swarm peers --streams | grep -A5 $node_1

That will tell me if:

  1. What node_1 is asking for.
  2. What node_0 thinks node_1 wants.
  3. If the peers are connected and what streams (requests) they have open.
rpodgorny commented 4 years ago

i am not the original poster but i'm also hit by this. a little bit different situation - for me node_1 is on flaky wifi connection but the outcome is the same...

node_0: QmYLhBZsWL2MQuNXFfn6Wr5snTTq2ys121nQ9FQoZnMLYa node_1: Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML

on node_1:

ipfs swarm peers --streams|grep -A5 QmYLhBZsWL2MQuNXFfn6Wr5snTTq2ys121nQ9FQoZnMLYa                                                              
/ip6/2a01:9420:9:301:e11e:2ca4:ea5c:5065/tcp/4001/p2p/QmYLhBZsWL2MQuNXFfn6Wr5snTTq2ys121nQ9FQoZnMLYa
  /ipfs/bitswap/1.2.0
  /ipfs/lan/kad/1.0.0
ipfs bitswap wantlist                                                                                                                           (+00:00:00) 2020-05-26 22:58:39
QmYhK5uZSrrxrVcpdZyaNTwNBUeXszy4DgWTDENypghMmm
QmcWQMg3MqRx1j5QXrq4JfyZ53NGMGt9M79meH7xUMdrke
QmdMojSFbivBaKVeGhdxvjAzhdWDAiDFq6fNKMH2XFQgWA
Qmf97jbWtmn2ZRAU6aXovYQZiMdCP5A9YFm9SH1igZM94E

on node_0:

ipfs swarm peers --streams|grep -A5 Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML                                                    (+00:00:00) 2020-05-26 17:56:43
/ip6/2002:2e24:2741:250::2/tcp/4001/p2p/Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML
  /ipfs/bitswap/1.2.0
ipfs bitswap wantlist -p Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML
<NOTHING>

sometimes i see this on node_0:

ipfs swarm peers --streams|grep -A5 Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML                                                    (+00:00:00) 2020-05-26 18:04:34
/ip6/2002:2e24:2741:250::2/tcp/4001/p2p/Qmcc6ZBqk9bMs85ZRL1h1gcowW4mFXsjuYBoVuxaYw1HML
  /ipfs/bitswap/1.2.0
  /ipfs/bitswap/1.2.0
  /ipfs/bitswap/1.2.0
  /ipfs/bitswap/1.2.0
  /ipfs/bitswap/1.2.0

...for me the situation is so bad that i have been unable to get anything from "bitswap wantlist -p" on node_0 even after restarting "ipfs get" and even restarting "ipfs daemon" on node_1 :-(

rpodgorny commented 4 years ago

also:

ipfs version --all                                                                                                                              (+00:00:00) 2020-05-26 23:11:00
go-ipfs version: 0.5.1-8431e2e87
Repo version: 9
System version: amd64/linux
Golang version: go1.14.2
rpodgorny commented 4 years ago

is there a way to force a "wantlist resend" on node_1 (to node_0)?

Stebalien commented 4 years ago

is there a way to force a "wantlist resend" on node_1 (to node_0)?

We send a full wantlist every 30s. However, we don't send every wantlist to every peer, only to peers that we think have the content. But when the download stalls, we should broadcast to all connected peers so something fishy is going on here.

/ipfs/bitswap/1.2.0 /ipfs/bitswap/1.2.0 /ipfs/bitswap/1.2.0 /ipfs/bitswap/1.2.0 /ipfs/bitswap/1.2.0

That can happen when sending/receiving blocks.


Could you try the new RC (ipfs-update install v0.6.0-rc1)? We had a potential run-away timeout that could maybe cause this issue. I think this is a new bug, but it would be good to confirm that.

@dirkmc thoughts?

rpodgorny commented 4 years ago

same behaviour with 0.6.0-rc1 (compiled from git). -> i can see some streams on both nodes but the wantlist does not seem to be getting transferred. :-(

rpodgorny commented 4 years ago

also, this may be somehow connected to a "provide" feature/bug - currently on node_1's wantlist: QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU

node_0:

ipfs cat QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU | md5sum
bbfe0cc17f6a01caf8434c858e0caa34  -

...so the block is there. but:

on both node_0 and node_1:

ipfs dht findprovs QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU
<NOTHING>

after almost two minutes.

so it's almost like node_0 is not properly announcing its data and combined with "we don't send every wantlist to every peer, only to peers that we think have the content" this may lead to this download-being-stuck situation.

rpodgorny commented 4 years ago

after a forcefull:

ipfs dht provide QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU

on node_0 i now correctly see:

ipfs dht findprovs QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU
QmYLhBZsWL2MQuNXFfn6Wr5snTTq2ys121nQ9FQoZnMLYa

on both node_0 and node_1 but even-thou there's still connection between the nodes, QmVQ9tTPJeG2uWgjky4UWt4uubqRJeUpKGCG3qkYdM5BUU still stays in node_1's wantlist and not appearing on node_0's wantlist listing from node_1. :-(

...even after running the findprovs. i'd expect that manually finding the provider on node_1 (which returns node_0 and considering we have connection to that node) would immediatelly solve the situation.

Stebalien commented 4 years ago

so it's almost like node_0 is not properly announcing its data and combined with "we don't send every wantlist to every peer, only to peers that we think have the content" this may lead to this download-being-stuck situation.

Sorry, we don't send the wantlist to all peers until we timeout. Then, we broadcast to all peers. That's why this is really strange.

rpodgorny commented 4 years ago

is there anything i can do to further investigate the issue?

dirkmc commented 4 years ago

@rpodgorny & @vrde thanks for taking the time to follow up on this issue, it really helps us a lot to receive these bug reports.

I've been trying to reproduce locally with some unit tests inside the go-bitswap project but I haven't been able to so far. It may help understand where the problem is if you output some debug logging while performing the ipfs get, on both the requesting node and the responding node.

To output debug level logging in the daemon window for a particular sub-system run the ipfs log level command from another window, eg: ipfs log level bs:sess debug

The subsystems that may output useful information:

RubenKelevra commented 4 years ago

@rpodgorny wrote:

maybe similar/dupe of #7211 ?

I don't think so. 7211 is more about the loss of DHT functionality after a disconnect, where it isn't properly restored/detected.

I'm currently checking it for the master, and it seems to be gone. :)

That's why I haven't updated the ticket anymore for some time.