Kubo OOM and locks our servers when pinning a big amount of CID

Mayeu commented 1 year ago

Checklist

[X] This is a bug report, not a question. Ask questions on discuss.ipfs.io.
[X] I have searched on the issue tracker for my bug.
[X] I am running the latest kubo version or have an issue updating.

Installation method

built from source

Version

Kubo version: 0.21.0
Repo version: 14
System version: amd64/linux
Golang version: go1.20.5

Config

{
  "API": {
    "HTTPHeaders": {}
  },
  "Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5001",
    "Announce": [],
    "AppendAnnounce": [],
    "Gateway": "/ip4/127.0.0.1/tcp/8080",
    "NoAnnounce": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/quic",
      "/ip4/0.0.0.0/udp/4001/quic-v1",
      "/ip4/0.0.0.0/udp/4001/quic-v1/webtransport",
      "/ip6/::/udp/4001/quic",
      "/ip6/::/udp/4001/quic-v1",
      "/ip6/::/udp/4001/quic-v1/webtransport"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb"
  ],
  "DNS": {
    "Resolvers": {}
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/3",
            "sync": false,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "20TB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false
    }
  },
  "Experimental": {
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "OptimisticProvide": false,
    "OptimisticProvideJobsPoolSize": 0,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "DeserializedResponses": null,
    "HTTPHeaders": {
      "Access-Control-Allow-Headers": [
        "X-Requested-With",
        "Range",
        "User-Agent"
      ],
      "Access-Control-Allow-Methods": [
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "*"
      ]
    },
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": ""
  },
  "Identity": {
    "PeerID": "xxxxxx"
  },
  "Internal": {
    "Bitswap": {
      "EngineBlockstoreWorkerCount": 128,
      "EngineTaskWorkerCount": 8,
      "MaxOutstandingBytesPerPeer": null,
      "ProviderSearchDelay": null,
      "TaskWorkerCount": 8
    }
  },
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Migration": {
    "DownloadSources": [],
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": [
      {
        "Addrs": [
          "/dnsaddr/node-1.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcFf2FH3CEgTNHeMRGhN7HNHU1EXAxoEk6EFuSyXCsvRE"
      },
      {
        "Addrs": [
          "/dnsaddr/node-2.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcFmLd5ySfk2WZuJ1mfSWLDjdmHZq7rSAua4GoeSQfs1z"
      },
      {
        "Addrs": [
          "/dnsaddr/node-3.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfFmzSDVbwexQ9Au2pt5YEXHK5xajwgaU6PpkbLWerMa"
      },
      {
        "Addrs": [
          "/dnsaddr/node-4.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfJeB3Js1FG7T8YaZATEiaHqNKVdQfybYYkbT1knUswx"
      },
      {
        "Addrs": [
          "/dnsaddr/node-5.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfVvzK4tMdFmpJjEKDUoqRgP4W9FnmJoziYX5GXJJ8eZ"
      },
      {
        "Addrs": [
          "/dnsaddr/node-6.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfZD3VKrUxyP9BbyUnZDpbqDnT7cQ4WjPP8TRLXaoE7G"
      },
      {
        "Addrs": [
          "/dnsaddr/node-7.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfZP2LuW4jxviTeG8fi28qjnZScACb8PEgHAc17ZEri3"
      },
      {
        "Addrs": [
          "/dnsaddr/node-8.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfgsJsMtx6qJb74akCw1M24X1zFwgGo11h1cuhwQjtJP"
      },
      {
        "Addrs": [
          "/dnsaddr/node-9.ingress.cloudflare-ipfs.com"
        ],
        "ID": "Qmcfr2FC7pFzJbTSDfYaSy1J8Uuy8ccGLeLyqJCKJvTHMi"
      },
      {
        "Addrs": [
          "/dnsaddr/node-10.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfR3V5YAtHBzxVACWCzXTt26SyEkxdwhGJ6875A8BuWx"
      },
      {
        "Addrs": [
          "/dnsaddr/node-11.ingress.cloudflare-ipfs.com"
        ],
        "ID": "Qmcfuo1TM9uUiJp6dTbm915Rf1aTqm3a3dnmCdDQLHgvL5"
      },
      {
        "Addrs": [
          "/dnsaddr/node-12.ingress.cloudflare-ipfs.com"
        ],
        "ID": "QmcfV2sg9zaq7UUHVCGuSvT2M2rnLBAPsiE79vVyK3Cuev"
      },
      {
        "Addrs": [
          "/dnsaddr/ipfs.ssi.eecc.de"
        ],
        "ID": "12D3KooWGaHbxpDWn4JVYud899Wcpa4iHPa3AMYydfxQDb3MhDME"
      },
      {
        "Addrs": [
          "/ip4/139.178.68.217/tcp/6744"
        ],
        "ID": "12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw"
      },
      {
        "Addrs": [
          "/ip4/147.75.49.71/tcp/6745"
        ],
        "ID": "12D3KooWGBWx9gyUFTVQcKMTenQMSyE2ad9m7c9fpjS4NMjoDien"
      },
      {
        "Addrs": [
          "/ip4/147.75.86.255/tcp/6745"
        ],
        "ID": "12D3KooWFrnuj5o3tx4fGD2ZVJRyDqTdzGnU3XYXmBbWbc8Hs8Nd"
      },
      {
        "Addrs": [
          "/ip4/3.134.223.177/tcp/6745"
        ],
        "ID": "12D3KooWN8vAoGd6eurUSidcpLYguQiGZwt4eVgDvbgaS7kiGTup"
      },
      {
        "Addrs": [
          "/ip4/35.74.45.12/udp/6746/quic"
        ],
        "ID": "12D3KooWLV128pddyvoG6NBvoZw7sSrgpMTPtjnpu3mSmENqhtL7"
      },
      {
        "Addrs": [
          "/dns4/elastic.dag.house/tcp/443/wss/p2p/QmQzqxhK82kAmKvARFZSkUVS6fo9sySaiogAnx5EnZ6ZmC"
        ],
        "ID": "QmQzqxhK82kAmKvARFZSkUVS6fo9sySaiogAnx5EnZ6ZmC"
      }
    ]
  },
  "Pinning": {
    "RemoteServices": {}
  },
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {
    "Interval": "0s",
    "Strategy": "roots"
  },
  "Routing": {
    "AcceleratedDHTClient": true,
    "Methods": null,
    "Routers": null,
    "Type": "autoclient"
  },
  "Swarm": {
    "AddrFilters": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 2048,
      "LowWater": 128
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "RelayClient": {
      "Enabled": false
    },
    "RelayService": {
      "Enabled": false
    },
    "ResourceMgr": {
      "Limits": {},
      "MaxMemory": "16GB"
    },
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  }
}

Description

Hello,

In the past month, we have been slowly pinning millions of CID with a 2 server ipfs cluster. Currently we are around 9M pinned CID on a total of 13.5M. Kubo has regularly been killed by the system for consuming all the memory, and from time to time it even completely locks out our servers and require a hard reboot.

We were waiting for the 0.21.0 release to open this ticket since we thought that the release would reduce RAM consumption, but in the past 24h both our servers have locked up again.

Both servers have the following spec:

AMD Ryzen 5 Pro 3600 - 6c/12t - 3.6 GHz/4.2 GHz
128 GB ECC 2666 MHz
2×512 GB SSD NVMe, one with the system (Nixos), the other is currently unused
one ZFS ZRAID-0 pool with 4×6 TB HDD SATA

We have tried a lot of different configuration, including disabling bandwidth metrics, disabling being a DHT server, and to activate or deactivate the Accelerated DHT client, but whatever the configuration we tried kubo always end-up consuming all available memory.

We are currently running Kubo with GOGC=50 and GOMEMLIMIT=80GiB

Here are two ipfs diag profile taken today:

In case it's relevant, for the ipfs-cluster we followed the setup guide in the documentation, we are keeping around 50k pin in the queue.

ipfs-cluster configuration

``` { "cluster": { "peername": "ipfs-cluster-1", "secret": "…", "leave_on_shutdown": false, "listen_multiaddress": [ "/ip4/0.0.0.0/tcp/9096", "/ip4/0.0.0.0/udp/9096/quic" ], "enable_relay_hop": true, "connection_manager": { "high_water": 400, "low_water": 100, "grace_period": "2m0s" }, "dial_peer_timeout": "3s", "state_sync_interval": "6h", "pin_recover_interval": "6h", "replication_factor_min": -1, "replication_factor_max": -1, "monitor_ping_interval": "15s", "peer_watch_interval": "5s", "mdns_interval": "10s", "pin_only_on_trusted_peers": false, "disable_repinning": true, "peer_addresses": [] }, "consensus": { "crdt": { "cluster_name": "ipfs-cluster", "trusted_peers": [ "*" ], "batching": { "max_batch_size": 500, "max_batch_age": "15s", "max_queue_size": 50000 }, "repair_interval": "1h0m0s", "rebroadcast_interval": "10s" } }, "api": { "ipfsproxy": { "listen_multiaddress": "/ip4/127.0.0.1/tcp/9095", "node_multiaddress": "/ip4/127.0.0.1/tcp/5001", "log_file": "", "read_timeout": "0s", "read_header_timeout": "5s", "write_timeout": "0s", "idle_timeout": "1m0s", "max_header_bytes": 4096 }, "pinsvcapi": { "http_listen_multiaddress": "/ip4/127.0.0.1/tcp/9097", "read_timeout": "0s", "read_header_timeout": "5s", "write_timeout": "0s", "idle_timeout": "2m0s", "max_header_bytes": 4096, "basic_auth_credentials": null, "http_log_file": "", "headers": {}, "cors_allowed_origins": [ "*" ], "cors_allowed_methods": [ "GET" ], "cors_allowed_headers": [], "cors_exposed_headers": [ "Content-Type", "X-Stream-Output", "X-Chunked-Output", "X-Content-Length" ], "cors_allow_credentials": true, "cors_max_age": "0s" }, "restapi": { "http_listen_multiaddress": "/ip4/127.0.0.1/tcp/9094", "read_timeout": "0s", "read_header_timeout": "5s", "write_timeout": "0s", "idle_timeout": "2m0s", "max_header_bytes": 4096, "basic_auth_credentials": null, "http_log_file": "", "headers": {}, "cors_allowed_origins": [ "*" ], "cors_allowed_methods": [ "GET" ], "cors_allowed_headers": [], "cors_exposed_headers": [ "Content-Type", "X-Stream-Output", "X-Chunked-Output", "X-Content-Length" ], "cors_allow_credentials": true, "cors_max_age": "0s" } }, "ipfs_connector": { "ipfshttp": { "node_multiaddress": "/ip4/127.0.0.1/tcp/5001", "connect_swarms_delay": "30s", "ipfs_request_timeout": "10m", "pin_timeout": "20s", "unpin_timeout": "3h0m0s", "repogc_timeout": "24h0m0s", "informer_trigger_interval": 0 } }, "pin_tracker": { "stateless": { "concurrent_pins": 20, "priority_pin_max_age": "24h0m0s", "priority_pin_max_retries": 5 }, "concurrent_pins": 20 }, "monitor": { "pubsubmon": { "check_interval": "15s" } }, "allocator": { "balanced": { "allocate_by": [ "tag:group", "freespace" ] } }, "informer": { "disk": { "metric_ttl": "30s", "metric_type": "freespace" }, "pinqueue": { "metric_ttl": "30s", "weight_bucket_size": 100000 }, "tags": { "metric_ttl": "30s", "tags": { "group": "default" } } }, "observations": { "metrics": { "enable_stats": true, "prometheus_endpoint": "/ip4/127.0.0.1/tcp/8888", "reporting_interval": "5s" }, "tracing": { "enable_tracing": false, "jaeger_agent_endpoint": "/ip4/0.0.0.0/udp/6831", "sampling_prob": 0.3, "service_name": "cluster-daemon" } }, "datastore": { "pebble": { "pebble_options": { "cache_size_bytes": 1073741824, "bytes_per_sync": 1048576, "disable_wal": false, "flush_delay_delete_range": 0, "flush_delay_range_key": 0, "flush_split_bytes": 4194304, "format_major_version": 1, "l0_compaction_file_threshold": 750, "l0_compaction_threshold": 4, "l0_stop_writes_threshold": 12, "l_base_max_bytes": 134217728, "max_open_files": 1000, "mem_table_size": 67108864, "mem_table_stop_writes_threshold": 20, "read_only": false, "wal_bytes_per_sync": 0, "levels": [ { "block_restart_interval": 16, "block_size": 4096, "block_size_threshold": 90, "compression": 2, "filter_type": 0, "filter_policy": 10, "index_block_size": 4096, "target_file_size": 4194304 }, { "block_restart_interval": 16, "block_size": 4096, "block_size_threshold": 90, "compression": 2, "filter_type": 0, "filter_policy": 10, "index_block_size": 4096, "target_file_size": 8388608 }, { "block_restart_interval": 16, "block_size": 4096, "block_size_threshold": 90, "compression": 2, "filter_type": 0, "filter_policy": 10, "index_block_size": 4096, "target_file_size": 16777216 }, { "block_restart_interval": 16, "block_size": 4096, "block_size_threshold": 90, "compression": 2, "filter_type": 0, "filter_policy": 10, "index_block_size": 4096, "target_file_size": 33554432 }, { "block_restart_interval": 16, "block_size": 4096, "block_size_threshold": 90, "compression": 2, "filter_type": 0, "filter_policy": 10, "index_block_size": 4096, "target_file_size": 67108864 }, { "block_restart_interval": 16, "block_size": 4096, "block_size_threshold": 90, "compression": 2, "filter_type": 0, "filter_policy": 10, "index_block_size": 4096, "target_file_size": 134217728 }, { "block_restart_interval": 16, "block_size": 4096, "block_size_threshold": 90, "compression": 2, "filter_type": 0, "filter_policy": 10, "index_block_size": 4096, "target_file_size": 268435456 } ] } } }, "metrics": { "enable_stats": true, "prometheus_endpoint": "/ip4/127.0.0.1/tcp/8888", "reporting_interval": "5s" } } ```

Jorropo commented 1 year ago

Cc @marten-seemann could you please take a look at those profiles ? Screenshot_2023-07-05-18-12-48-000_org.mozilla.firefox.jpg

marten-seemann commented 1 year ago

Mayeu commented 1 year ago

Thank you for the feedback, we have relaunched both nodes without anything using QUIC for now.

Jorropo commented 1 year ago

@Mayeu how did you do that exactly ? Swarm.Transports.Network.QUIC ? Did it worked ?

Mayeu commented 1 year ago

@Jorropo I deactivated both Swarm.Transports.Network.QUIC & Swarm.Transports.Network.WebTransport (since the doc says it uses QUIC).

Did it worked ?

Hard to tell for now, RAM growth seems pretty similar than with the previous configuration, previously kubo was killed after 9-11h of uptime.

Left part is our previous run until this morning when the server started to stop answering. Right part is since we deactivated QUIC.

Here is a profile taken right now.

We definitely have seen a drop in pin/s since this morning, maybe some of our peers were only using QUIC.

Mayeu commented 1 year ago

A note on the profile in my last comment, there is apparently still memory allocated to QUIC, this may relate to #9895

Jorropo commented 1 year ago

I deactivated both Swarm.Transports.Network.QUIC & Swarm.Transports.Network.WebTransport (since the doc says it uses QUIC).

Yes thx, that is good, I wanted to be sure you didn't just removed the quic multiaddresses.

SmaugPool commented 9 months ago

I can confirm. I'm syncing 64GB RAM nodes with a few millions pins and I have to restart kubo every 12 hours to avoid OOM killing it:

And I use only 4 concurrent pins on ipfs-cluster and conservative Kubo settings:

  "Internal": {
    "Bitswap": {
      "EngineBlockstoreWorkerCount": 16,
      "EngineTaskWorkerCount": 8,
      "MaxOutstandingBytesPerPeer": 1048576,
      "ProviderSearchDelay": null,
      "TaskWorkerCount": 8
    }
  },
  "Swarm": {
    "ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 128,
      "LowWater": 64,
      "Type": "basic"
    },
    "ResourceMgr": {
      "Enabled": true,
      "MaxMemory": "8 GB"
    },
  }

I tried disabling QUIC but then I lose 99% of connections so everything is way slower and it is therefore hard to tell if there is still a memory leak, but memory seemed to continue slowly increasing.

Memory stopped increasing as soon as the pin queue became empty:

lidel commented 7 months ago

Triage notes:

@SmaugPool are you still running with QUIC disabled?
are you able to retry with latest Kubo, and produce same two ipfs diag profile as before?
- Around 2h before the server start to lock up
- While the server started to stop responding

ipfs / kubo