Pinning large CID is "stuck" at same node

kallisti5 commented 3 years ago

Version information:

go-ipfs version: 0.8.0
Repo version: 11
System version: arm64/linux
Golang version: go1.15.8

I have 22GiB+ on MFS data on a node at Vultr. Latest MFS CID is pointed to via /ipns/hpkg.haiku-os.org

Description:

I pin /ipns/hpkg.haiku-os.org on my home IPFS node. (either of the below, same result)
- ipfs pin add --progress /ipns/hpkg.haiku-os.org
- ipfs pin add --progress /ipfs/QmPZvhACHCf1uXr5nVwvpSJSYjQviShZkoZ4ctx6LQ6q37
It downloads 20GiB of data
pin stuck on: Fetched/Processed 95036 nodes for like 2 days.
Restart ipfs. 20GiB deleted (--enable-gc, ugh)
I pin /ipns/hpkg.haiku-os.org again
24 hours later, 22 GiB of data in repo.
pin once again stuck on: "Fetched/Processed 95036 nodes" for ~12 hours now.

sysstat from home node(while stuck pinning)

# sar 5 1
Linux 5.9.0-0.bpo.2-arm64 (ipfsgw.discord.local)    05/02/21    _aarch64_   (4 CPU)

13:06:38        CPU     %user     %nice   %system   %iowait    %steal     %idle
13:06:43        all      0.10     12.60      4.63      0.15      0.00     82.51
Average:        all      0.10     12.60      4.63      0.15      0.00     82.51

root@ipfsgw:~# sar -d 5 1
Linux 5.9.0-0.bpo.2-arm64 (ipfsgw.discord.local)    05/02/21    _aarch64_   (4 CPU)

13:05:56          DEV       tps     rkB/s     wkB/s   areq-sz    aqu-sz     await     svctm     %util
13:06:01     dev179-0      0.20      0.80      0.00      4.00      0.00      1.00      4.00      0.08
13:06:01       dev8-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
13:06:01      dev8-16      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Debug from home node (while stuck pinning)

curl localhost:5001/debug/pprof/goroutine\?debug=2 > ipfs.stacks
curl localhost:5001/debug/pprof/profile > ipfs.cpuprof
curl localhost:5001/debug/pprof/heap > ipfs.heap
curl localhost:5001/debug/vars > ipfs.vars
ipfs diag sys > ipfs.sysinfo

homenode-debug.tar.gz

aschmahmann commented 3 years ago

@kallisti5 could you grab a little more debugging info:

Vultr node
- the debug info
- ipfs config show
- run ipfs dag stat --offline QmSSxDK8br4Z5UHQBXZWuXew8uP2uJ98oYT1HQAw7QrwKT to make sure it completes ok
- If not then it means the data isn't on the node you think it's on
- reboot the node and see if the problem resolves on the client end
- To avoid any debugging confusion around data discovery from the client try doing ipfs swarm connect /full/address/of/vultr/node on the client node after the reboot
home node
- ipfs config show
- ipfs bitswap stat should show some of the blocks you are waiting on, if those blocks are part of the QmSS... graph it could be easier to determine what's going on

This might be a duplicate of #7972, which appears to be fixed in master of go-bitswap (the dependency update isn't yet included in go-ipfs master).

kallisti5 commented 3 years ago

Home node (wants)

ipfs@ipfsgw:~$ ipfs bitswap stat
bitswap status
        provides buffer: 0 / 256
        blocks received: 117310
        blocks sent: 12037
        data received: 23934805330
        data sent: 2561571207
        dup blocks received: 4393
        dup data received: 691026392
        wantlist [1 keys]
                Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD
        partners [763]

Vultr node (has):

ipfs bitswap stat
bitswap status
    provides buffer: 0 / 256
    blocks received: 0
    blocks sent: 99009
    data received: 0
    data sent: 22941823079
    dup blocks received: 0
    dup data received: 0
    wantlist [0 keys]
    partners [785]

Vultr Node (has) (putting the wantlist from my home node into dag stat for the vultr node:)

# ipfs dag stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD
Size: 0, NumBlocks: 1

Vultr Node (has):

root@ipfs:~# ipfs dag stat --offline QmPZvhACHCf1uXr5nVwvpSJSYjQviShZkoZ4ctx6LQ6q37
Size: 5111615603, NumBlocks: 22289
(still running. I'll update here if it fails)

kallisti5 commented 3 years ago

ipfsconfig-home.txt ipfsconfig-vultr.txt

aschmahmann commented 3 years ago

I pinged your Vultr node and it responds as not having Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD. If running ipfs block stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD on the Vultr node that would be interesting.

I noticed you updated the CID in your comment that you were trying to pin, since you have garbage collection enabled then it's possible that Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD has since been cleaned up on the Vultr node and so if the home node is still looking for it it's not going to find it.

kallisti5 commented 3 years ago

I pinged your Vultr node and it responds as not having Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD. If running ipfs block stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD on the Vultr node that would be interesting.

Ooh.. I think this might be the root cause. I posted ipfs dag stat --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD above from the Vultr node showing it did indeed have that block (though oddly empty). (Size: 0, NumBlocks: 1 )

Check this out (vultr node)

root@ipfs:~# ipfs dag get --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD
{"data":null,"links":[]}

Here's the completed diag stat for the whole thing on the vultr node:

ipfs dag stat --offline QmPZvhACHCf1uXr5nVwvpSJSYjQviShZkoZ4ctx6LQ6q37
Size: 22827570312, NumBlocks: 95529

So... locally it thinks it has that block, but remotely it reports not having it?

aschmahmann commented 3 years ago

root@ipfs:~# ipfs dag get --offline Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD {"data":null,"links":[]}

This seems a bit weird to me since I'd expect an empty DagPb node to have the hash QmdfTbBqBPQ7VNxZEYEj14VmRuZBkqFbiwReogJgS1zR1n DagPb spec. I wonder if there is some datastore corruption (which would cause your node to send back data but my node to not recognize it as having the CID I was expecting).

Could you run ipfs dag export Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD > empty.car and then ipfs add the car file and post the CID here so I can take a look? Note: What I'm asking you to do is basically export the DAG underneath that CID (which is just a single node in this case) to a file so we can investigate it even if it has some corruption that is preventing it from being sent over the network.

It may take a while but it'd be helpful to run ipfs repo verify to check on repo corruption issues.

kallisti5 commented 3 years ago

# ipfs dag export Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD > empty.car
 0s  92 B / ? [-------------------------------------------------------------------------------------------------------------------=-------------------------------------------------------------------------------------------] 3.22 KiB/s 0s 
# ipfs add empty.car
added QmdWr79vNxtjzVdQBvK4BVH4K94jD9g54t2BzRAuMzUjVc empty.car

Running a ipfs repo verify now

aschmahmann commented 3 years ago

❯ ipfs dag import .\QmdWr79vNxtjzVdQBvK4BVH4K94jD9g54t2BzRAuMzUjVc
Pinned root     Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD  FAILED: blockstore: block not found
Error: unable to pin all roots: 1 out of 1 failed

Ya, something weird is going on. I'm guessing a corrupted block. The CAR file contains a single block which is the empty block with the CID QmdfTbBqBPQ7VNxZEYEj14VmRuZBkqFbiwReogJgS1zR1n.

kallisti5 commented 3 years ago

Yup.

# ipfs repo verify
block QmTRQhKWeLJUDvLqxDZmBkLwJSSy7B3Er9ytAhQcDiG1Lf was corrupt (block in storage has different hash than requested)
block QmZ7rUYoMm4cpnxCHS9KBJeuUPmcyZsg4Qgi3EAjUq3iTM was corrupt (block in storage has different hash than requested)
block Qme5ubrNa932DxVxMtTkkjpiaKMLhVPfoFszyGfhS5BctD was corrupt (block in storage has different hash than requested)

How do you even fix that when it's MFS? (is there a way to identify which files and re-upload them? I don't really understand how they got corrupted.. the node is brand new and I just added the files without any real issues.

kallisti5 commented 3 years ago

To wrap this one up. I ended up just deleting the repo and re-adding the data under flatfile instead of badger. Things are smooth now.

Since this is the "seed" node for MFS data, repo stability is important. It sounds like badger is better for user nodes, while flat is better for "critical data sources"

However... this case might be useful for some enhancement? I had CID's that nothing saw as invalid (and the MFS filesystem was fully available) until I did a full repo validation. This seems like a potentially common issue.

godcong commented 3 years ago

I have had random stuck events. But retry after the interruption passed. I am using go-http-api-client I was stuck for a whole day while pinning a cid.

Is this the same problem? @kallisti5

kallisti5 commented 3 years ago

@godcong

I have had random stuck events. But retry after the interruption passed.

There's a mix of issues here:

Repo corruption isn't obvious, and IPFS doesn't "notice" until you repo verify (which is slow)
Pinning large data sets, if IPFS can't find a 'valid' copy of data it just hangs 'waiting' for it
Pinning progress isn't really clear (https://github.com/ipfs/go-ipfs/issues/8112)
Enabled garbage collection and a low GCThreshold means if you 'interrupt' a pin of data over your GCThreshold, the node immediately will garbage collect your progress.
- https://github.com/ipfs/go-ipfs/issues/3121 seems to be the "fix" for this one?
- An alternative "nice feature" would be to wait XX hours before garbage collecting any new data chunks. (giving users a chance at "resuming")

These are the "biggest" issues we encountered trying to use IPFS at scale. They represent some really frustrating user experiences.

ipfs / kubo