ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.11k stars 3.01k forks source link

Garbage collector deletes data stored in the MFS (which was pinned) #7008

Open RubenKelevra opened 4 years ago

RubenKelevra commented 4 years ago

Version information:

go-ipfs version: 0.4.23-6ce9a355f Repo version: 7 System version: amd64/linux Golang version: go1.14

Description:

I'm using IPFS in a script which updates the local MFS as needed. New files are added with ipfs cp /ipfs/<cid> /path/to/file after ipfs-cluster-ctl added them to the cluster.

So the files are pinned locally (by the cluster service) and also stored in the MFS.

Files which should be deleted are removed from the MFS and I use ipfs-cluster-ctl to add a expire timeout of 14 days to the file.

Since I started to add a lot of files to the repo, I decided to let the garbage collector deal with old stuff and clean up the repo.

After the garbage collector completed the work.

Now I cannot get the hashes or the content of some in the MFS stored files. This is unexpected and should not happen (as far as I understand).

ipfs files ls /path/to/file/ | grep "filename" shows that the directory still contains the file, when the daemon is freshly started. After a files stat --hash on the file, the directory cannot be listed anymore until the daemon is restarted.

$ ipfs files stat --hash --timeout 120s /path/to/a/file.img
Error: Post "http://127.0.0.1:5001/api/v0/files/stat?...&hash=true&stream-channels=true&timeout=120s": context deadline exceeded

ipfs-cluster-ctl shows me the CID and that it's allocated on the local node (and pinned).

ipfs dht findprovs <CID> (the cid taken from ipfs-cluster-ctl) returns with no result - which explains why I cannot access the file anymore.

ipfs pin ls --timeout=120s /ipfs/<CID> results in a timeout.

$ ipfs repo verify returns with a successful integrity check of the repo.

IPFS/IPFS-Cluster stores the blocks and the databases on a ZFS filesystem which reports no integrity errors.

RubenKelevra commented 4 years ago

After a fresh start of the ipfs-daemon I cannot remove the one file I identified so far from the MFS.

$ ipfs files rm /path/to/file.bin does not return

I try to recover from the situation by just adding all files again to the ipfs repo (with pin=0). Hopefully just the blocks are missing and not the metadata is corrupt.

RubenKelevra commented 4 years ago

So the issue are 'just' missing blocks, which also lead to non-fullfillable requests like files stat --hash on a file with missing blocks or non-working files rm.

After adding all files again without pinning I could remove the file with the issue and found 3 other files which blocks was also missing. I added them too from a backup and could continue.

So the GC seems to be not safe to use when anything is happening to the MFS, especially worrying was for me that the file was in the MFS and pinned too. Since the files was all pinned I don't see how this was happening in the first place. Maybe ipfs-cluster-service is unpinning and pinning again right afterwards when I add a timeout to a pin with ipfs-cluster-ctl pin add --expire-in and for the short duration while the file was unpinned it got removed.

This still doesn't explain, while a file which is in the MFS can lose it blocks when the GC is running.

ribasushi commented 4 years ago

This sounds like a missing lock somewhere. The team is in over-drive right now trying to get https://github.com/ipfs/go-ipfs/issues/6776 out the door, so response might be delayed by a week or two. Sorry about that!

RubenKelevra commented 4 years ago

@ribasushi I don't expect a priority on this one, since it's just a race condition anyway. Maybe just happening in my setup and similar ones.

But I think it should be reviewed if the first RC is out, just to make sure it's not a widespread issue. :)

I commented several times to document my recovery efforts to make sure to get the most informations on this event as possible, not to push it again.

Some thoughts on this topic:

There was no error, warning or info message while this happened or afterwards while the access was not possible.

I'm wondering how files stat --hash can be impacted by missing data, since a simple files ls can list the content of the folder. I think that a stat with --hash is trying to read too much data - it should just access the directory listing and return the hash.

I'm not sure how the files rm can fail if the element is missing. I think this could be optimized too, that it doesn't require access to the data behind a CID, if the user request to remove it. GC would remove the CID and any blocks remaining anyway, since they are not referenced anymore. Or am I missing something? 🤔

RubenKelevra commented 4 years ago

I can confirm this bug for this version as well:

go-ipfs version: 0.5.0-dev-6c45f9ed9
Repo version: 9
System version: amd64/linux
Golang version: go1.13.8

I basically have to stop my scripts and add the data back to the repo with pin=0 to make sure everything is still available for IPFS after each run of the GC :/

schomatis commented 2 years ago

Probably related to #6113.