Guardian pulls large amounts of foreign IPFS data onto local disk

AlexIvanHoward commented 6 months ago

Problem description

Our Guardian instance, which is configured to use web3.storage as its IPFS pinning service provider, pulls large amounts of IPFS data onto local disk. The majority of this data is not our own data. For example, we currently have a mere 1.5 MB of data on our web3.storage account, but our Guardian instance has already pulled 22 GB of IPFS data from the web into its ./runtime-data/ipfs/data/ directory since we've completely cleared that directory a week ago. This is not the result of a once-off pull of data; it seems to be a continuous process, because the amout of IPFS data in the directory keeps on growing even when we're not doing anything whatsoever on our Guardian instance. I've been monitoring this process now for a while on an instance which currently has no other users than myself. I check the size of the ./runtime-data/ipfs/data/ directory as the last thing at the end of my workday, and then check the size again the next morning as the very first thing at the start of my workday. The directory is consistently at least ~1 GB of data larger in the morning than when I stopped working on it in the evening.

It has also happenened numerous times now that a huge amount of IPFS data has been pulled onto disk in a very short period of time, ultimately causing our cloud instance to crash, because the Guardian tried pulling more IPFS data onto its disk than the machine had space for. A week ago, for example, our machine - which currently has 200 GB of disk space allocated to it - crashed overnight. When I inspected the situation in the morning, I found that the crash was caused by the Guardian pulling 147 GB of data from IPFS, leaving no space left on the local disk. Our own web3.storage IPFS account did not even have 1.4 MB of data in it at the time.

We noticed this behaviour for the first time during the first half of 2023. Unfortunately I cannot provide more information on e.g., after which release this started to happen.

We don't know if this also happens on mainnet; our Guardian instances are all still running on testnet.

I am not sure if this is a bug or if it is related to tickets such as #2629 and #3046. My expectation, however, is that a Guardian instance, when the IPFS storage provider is a cloud-based provider (such as web3.storage) and therefore NOT a local IPFS node, should not be pulling any IPFS data onto local disk which is not in the IPFS account of the Guardian instance itself.

Given the problem description above, I have two questions:

Why does the Guardian do this?
Is there any way to prevent the Guardian from doing this?

anvabr commented 5 months ago

@AlexIvanHoward Thank you for the bug report, Guardian should not be doing this and most definitely not when it interacts with IPFS via web3.storage. Can you please clarify if at all possible now:

which Guardian version have you seen this most recently on, especially significant info is whether it is the 'new' w3up API or their old API.
could you attach the acting .env.<develop>.guardian.system file of the offending instance (please sanitise it replacing the key values with dummy data)
for debugging if at all possible it'd be extremely useful to get access to the data it has downloaded, or an example of the data. Do you still have them?

AlexIvanHoward commented 5 months ago

I've seen this most recently on Guardian version 2.21.1, so my most recent experience of this is on the new w3up API. I have, however, also seen it on Guardian versions using the "legacy" web3storage API.
Attached is a copy of my .env.develop.guardian.system file. dotenv.develop.guardian.system.txt
Below are some screenshots etc. for more context and information:

3.a. There is currently 79 GB of data in our Guardian instance's ./runtime-data/ipfs/data/blocks directory, but according to our web3.storage account, we only currently have 2.9 MB of data stored across all our spaces.

''

3.b. A list of all the directories currently in ./runtime-data/ipfs/data/blocks: There are currently 1027 directories.

3.c. The data is distributed fairly evenly across the directories in ./runtime-data/ipfs/data/blocks. Most directories contain between 70 MB and 90 MB of data. This screenshot shows the sizes of the first 13 directories.

3.d. The first few files in, for example, the ./runtime-data/ipfs/data/blocks/EP directory. There are currently 341 such files in our EP directory and each is on average 256 KB in size (as can be expected, because these are all IPFS blocks).

3.e. An example of one of the files in the EP directory. Note that I have changed the file's extension from .data to .txt to bypass GitHub's block on files with non-standard extensions. Please change it back to .data after downloading on your side. The content, however, is raw binary data. I have not tried it yet, but according to IPFS documentation, one should be able to view the contents of and IPFS block .data file like this one using command 'ipfs cat' or something similar.

CIQA3K7FGVFJR4ACBW7CVGK5D34P7WCKVMLFC6EK3ZNDR7KPIVE6EPA.txt

In summary, and based on my fairly amateur evaluation, there seems to be nothing strange going on in the IPFS's blocks/ directory, except that it is pulling a large amount of blocks that are not part of our own Guardian data.

@mattsmithies and @MatYarger, can you perhaps also provide this information from your side?

anvabr commented 5 months ago

Thank you @AlexIvanHoward, we are working on this. It must be some sort of default configuration issue of the IPFS node as, per your configuration file, Guardian 'reads' files via IPFS gateway not the local IPFS node:

IPFS_PUBLIC_GATEWAY='https://ipfs.io/ipfs/${cid}'

It would be great to hear from @mattsmithies if this setup matches his? Perhaps this somehow confuses the node, although on the other hand everything Guardian produces of what is written into IPFS it first writes into the local DB - this is where it reads it from when/if needed. So it'll only reach out to the public gateway for 'external' artifacts.

MatYarger commented 5 months ago

Hey @AlexIvanHoward yeah, we hit the same issue on our side. We're running our Guardian instance on a 128GB VM, and it'll max out. Our IPFS/block directory will shoot up to 79GB and crash our VM, forcing us into a lengthy recovery process. It seems like our VMs are aligned or something since we're getting hit with the same exact amount of data increase and getting the same crashing result.

anvabr commented 5 months ago

Thank you for clarifications @MatYarger , could I please ask you to detail your Guardian config IPFS options (the file is called .env.<develop>.guardian.system or something similar - you had modified it when you were installing Guardian), in particular the value ofIPFS_PUBLIC_GATEWAY? Just want to make sure we figure out the setup in case it is relevant.

MatYarger commented 5 months ago

Yeah we can get that for you in a bit. @dyrellC has all the config options on his side, so he should be able to pull that for you.

Neurone commented 5 months ago

Hi, if you use an external pinning service like web3.storage and the public ipfs.io gateway, you can solve the issue by entirely disabling the Kubo container; it's not used in that scenario.

If you want to run a local node, it should be fine, even though I suggest creating your own node separated by Guardian (the local Kubo image in the repo is for development purposes only).

In general, the IPFS node should not download data if not directly requested by some client unless there is a bug in the Kubo image.

This seems to be a bug, so we can file an issue request to the Kubo repo, but in the meantime, the quick solution is to enable garbage collection. The default config allows for a max of 10Gb of storage data, triggering GC at 90%, with checks every hour, but you can tweak those parameters by changing the config file.

Also, consider that Guardian 2.22 uses Kubo v0.22.0, but the next release will use Kubo 0.26 (and 0.27 is already out). It would be helpful if you could check the latest 0.27 and verify if this behavior happens again.

AlexIvanHoward commented 5 months ago

Thanks a lot, @Neurone.

anvabr commented 5 months ago

The slack discussion with Kubo maintainers (Stebalien and Jorropo) did not come to much, main points:

maintainers confirmed that this behaviour is not expected, i.e. there is a real problem there.
maintainers refused to entertain the possibility that this might be a bug in Kubo, insisting that something (i.e. Guardian) "tells it" to download the data.
maintainers refused to try to run and examine the config of Guardian even though we provided a 'drop-in' config file containing everything needed to instantiate a local Guardian instance with a single shell command.
maintainers suggested that it may be the network configuration of Guardian installation leaves the Kubo node open to the internet, and through such connections someone or something instructs the node to download the unrelated IPFS data. Recommended to run ss -tlp command to examine active socket connections while the issue is being observed.

mattsmithies commented 4 months ago

I can confirm in this thread that we've actively suffered through this issue. Is it possible to skip the building of Kubo in docker composer if local isn't set?

AlexIvanHoward commented 4 months ago

@mattsmithies I took @Neurone advice and modified the docker-compose.yml file of the Guardian to skip the building of the IPFS node and also to ignore any declarations of dependency on that node (I literally just commented out everything related to the IPFS node). It's working for now :) but it's obviously only a temporary solution.

anvabr commented 4 months ago

I think I misunderstood the problem, I assumed you wanted to run a local node. In this case the docker container can be safely shut-down, I use docker desktop with which this is done via UI:

Screenshot 2024-03-28 at 09 10 59

anvabr commented 4 months ago

We are continuing to investigate the problem further since the reported behaviour is clearly wrong. We have not been able to reproduce the problem locally so far, if you are observing the problem locally at this time please contact us so we can examine your system.

hashgraph / guardian

Guardian pulls large amounts of foreign IPFS data onto local disk #3206

Problem description