ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.15k stars 3.01k forks source link

possible bitswap stall issue #5183

Closed whyrusleeping closed 5 years ago

whyrusleeping commented 6 years ago

In IRC, @fiatjaf reported that one of his ipfs nodes running 0.4.15 on a VPS was stalling trying to list out a particular directory. I confirmed that all the data was accessible, and even fetched it all to my local node. He then connected that VPS peer to my node, and it still couldnt fetch the data. The peers wantlist showed a single hash in it, that my node definitely had. If the nodes were actually connected successfully, then this implies a possible bug in bitswap.

Further questions I have here are around whether or not the fetch was using sessions. getting a stack dump of any node in this position would be nice too.

fiatjaf commented 6 years ago

Actually the wantlist output I pasted was from after I had restarted the node and the problem was solved. I don't know how it was before. or how many nodes were in the wantlist.

skliarie commented 6 years ago

I have same story with number of hashes. If using ipfs-go (with chrome's ipfs-companion), the hash is not downloading. If I disable ipfs-companion (thus enabling ipfs-js) - then download works fast. How can I debug the issue or provide you with necessary logs?

Stebalien commented 6 years ago

@skliarie

Also, try to reproduce with the latest release candidate running on both machines: https://dist.ipfs.io/go-ipfs/v0.4.16-rc1

skliarie commented 6 years ago
  1. (I ran ipfs get in separate window):

    ipfs@ipfs1:~$ ipfs bitswap wantlist QmbwEsezethaQhtrUosVCQFccn7Ze6KSENo2hnXv3aXfKP

  2. Output of "ipfs swarm peers --streams": ipfs_swarm_peers_streams.gz

  3. No idea how to get ID of the peer with the given hash. You can find it yourself, the hash is publicly available (e.g. ipfs-js can fetch it quickly).

  4. Find attached. Using ipfs 0.4.15 ipfs.sysinfo.gz ipfs.stacks.gz ipfs.heap.gz ipfs.gz ipfs.cpuprof.gz

  5. Will try ipfs 0.4.16-rc1 shortly.

skliarie commented 6 years ago

Tested on 0.4.16-rc1 (amd64), same problem. Attached debug data: ipfs.sysinfo.gz ipfs.stacks.gz ipfs.heap.gz ipfs.cpuprof.gz

Stebalien commented 6 years ago

No idea how to get ID of the peer with the given hash. You can find it yourself, the hash is publicly available (e.g. ipfs-js can fetch it quickly).

Ah... So, js-ipfs doesn't use the DHT to find or announce content (last time I checked). I'm guessing:

  1. Nobody has announced that they have the hash in question to the DHT. A quick ipfs dht findprovs QmbwEsezethaQhtrUosVCQFccn7Ze6KSENo2hnXv3aXfKP yields no results.
  2. Your js-ipfs node is getting automatically connected to the node with the content in question.

Looking at the debug info, I don't see any obvious deadlocks/issues. Without knowing which node has the hash, it's a bit difficult to tell where the issue is.

skliarie commented 6 years ago

Something strange going on. Here is another interesting hash: QmbeSiN8d7wxfonK5ahikVnwYmhJw14gfW4uNjrm8UEjW3

ipfs-js finds it pretty quickly, but ipfs-go has problems with it. Somehow, my public node QmTtggHgG1tjAHrHfBDBLPmUvn5BwNRpZY4qMJRXnQ7bQj (0.4.16-rc1) managed to download it in the past (e.g. "ipfs get" works), but getting it from another node (also 0.4.16-rc1) does not:

$ ipfs dht findprovs QmbeSiN8d7wxfonK5ahikVnwYmhJw14gfW4uNjrm8UEjW3
Error: routing service is not a DHT

What is going on? How could it be that ipfs-go and ipfs-js have different (incompatible?) routing services?

Stebalien commented 6 years ago

That's a new bug, fixed in #5200. Could very well have caused the issue. To work around it, you can disable IPNS over pubsub.

However, that also wouldn't (as far as I know) be responsible for the original bug.

Stebalien commented 6 years ago

One potential cause is a peer restart. That is, if one of the peers restarts but the other sees the new connection before seeing the old connection close, it won't re-send the wantlist.

We can fix this by either:

  1. Keeping state per stream (sending the entire wantlist every time we open a new stream).
  2. Using some form of bitswap "session" ID.
ninkisa commented 5 years ago

Hello, is there a fix about this?We have a similar problem running ipfs in a private network

At one point ipfs just stops downloading We are using ipfs-go and ipfs is running in a docker container [machine02]$ docker exec ipfs_container ipfs version ipfs version 0.4.18

Result from "ipfs bitswap wantlist" and ipfs swarm peers --streams ipfs_bitswap.txt

Debug logs ipfs_stacks.zip

Is it possible the reason to be the use of the quic protoc?

Thanks in advance for the support

Stebalien commented 5 years ago

QUIC shouldn't affect this. We're going to release a new release ASAP, probably by the end of the week with a completely refactored bitswap so let's see what that does for this.

Stebalien commented 5 years ago

New information: @mattober has run into this issue. He has two nodes: A gateway and a "host" (storing the data).

The gateway shows two connections to the host, one IPv4, one IPv6. The IPv4 connection has an open DHT stream and the IPv6 connection has has an open DHT stream (!?) and an open bitswap stream.

The host shows one connection to the gateway (IPv4). This connection has an open DHT stream and an open relay stream (!?) and no bitswap stream.

Stebalien commented 5 years ago

Related: https://github.com/ipfs/go-bitswap/issues/99, https://github.com/ipfs/go-bitswap/issues/99#issuecomment-476355204.

hannahhoward commented 5 years ago

I feel like there is no specific actionable information on a current version of bitswap to work with here, given that it's been near rewritten completely since 0.4.15 and the only current potential problem referenced is peers not resending wantlists on a reconnect. My belief is that we've addressed this as best we can with the periodic wantlist rebroadcast. And beyond that really there's no further improvement beyond error correction in the protocol. So I am inclined to close this issue, understand for anyone following it that we are still pursuing avenues to address potential stalls on an ongoing basis as we identify potential issues in current code.