Open Mayeu opened 1 year ago
Cc @marten-seemann could you please take a look at those profiles ?
Maybe related: https://github.com/quic-go/quic-go/issues/3883.
Thank you for the feedback, we have relaunched both nodes without anything using QUIC for now.
@Mayeu how did you do that exactly ? Swarm.Transports.Network.QUIC
? Did it worked ?
@Jorropo I deactivated both Swarm.Transports.Network.QUIC
& Swarm.Transports.Network.WebTransport
(since the doc says it uses QUIC).
Did it worked ?
Hard to tell for now, RAM growth seems pretty similar than with the previous configuration, previously kubo was killed after 9-11h of uptime.
Left part is our previous run until this morning when the server started to stop answering. Right part is since we deactivated QUIC.
Here is a profile taken right now.
We definitely have seen a drop in pin/s since this morning, maybe some of our peers were only using QUIC.
A note on the profile in my last comment, there is apparently still memory allocated to QUIC, this may relate to #9895
I deactivated both Swarm.Transports.Network.QUIC & Swarm.Transports.Network.WebTransport (since the doc says it uses QUIC).
Yes thx, that is good, I wanted to be sure you didn't just removed the quic multiaddresses.
I can confirm. I'm syncing 64GB RAM nodes with a few millions pins and I have to restart kubo every 12 hours to avoid OOM killing it:
And I use only 4 concurrent pins on ipfs-cluster
and conservative Kubo settings:
"Internal": {
"Bitswap": {
"EngineBlockstoreWorkerCount": 16,
"EngineTaskWorkerCount": 8,
"MaxOutstandingBytesPerPeer": 1048576,
"ProviderSearchDelay": null,
"TaskWorkerCount": 8
}
},
"Swarm": {
"ConnMgr": {
"GracePeriod": "20s",
"HighWater": 128,
"LowWater": 64,
"Type": "basic"
},
"ResourceMgr": {
"Enabled": true,
"MaxMemory": "8 GB"
},
}
I tried disabling QUIC but then I lose 99% of connections so everything is way slower and it is therefore hard to tell if there is still a memory leak, but memory seemed to continue slowly increasing.
Memory stopped increasing as soon as the pin queue became empty:
Triage notes:
ipfs diag profile
as before?
Checklist
Installation method
built from source
Version
Config
Description
Hello,
In the past month, we have been slowly pinning millions of CID with a 2 server ipfs cluster. Currently we are around 9M pinned CID on a total of 13.5M. Kubo has regularly been killed by the system for consuming all the memory, and from time to time it even completely locks out our servers and require a hard reboot.
We were waiting for the 0.21.0 release to open this ticket since we thought that the release would reduce RAM consumption, but in the past 24h both our servers have locked up again.
Both servers have the following spec:
We have tried a lot of different configuration, including disabling bandwidth metrics, disabling being a DHT server, and to activate or deactivate the Accelerated DHT client, but whatever the configuration we tried kubo always end-up consuming all available memory.
We are currently running Kubo with
GOGC=50
andGOMEMLIMIT=80GiB
Here are two
ipfs diag profile
taken today:In case it's relevant, for the ipfs-cluster we followed the setup guide in the documentation, we are keeping around 50k pin in the queue.
ipfs-cluster configuration
``` { "cluster": { "peername": "ipfs-cluster-1", "secret": "…", "leave_on_shutdown": false, "listen_multiaddress": [ "/ip4/0.0.0.0/tcp/9096", "/ip4/0.0.0.0/udp/9096/quic" ], "enable_relay_hop": true, "connection_manager": { "high_water": 400, "low_water": 100, "grace_period": "2m0s" }, "dial_peer_timeout": "3s", "state_sync_interval": "6h", "pin_recover_interval": "6h", "replication_factor_min": -1, "replication_factor_max": -1, "monitor_ping_interval": "15s", "peer_watch_interval": "5s", "mdns_interval": "10s", "pin_only_on_trusted_peers": false, "disable_repinning": true, "peer_addresses": [] }, "consensus": { "crdt": { "cluster_name": "ipfs-cluster", "trusted_peers": [ "*" ], "batching": { "max_batch_size": 500, "max_batch_age": "15s", "max_queue_size": 50000 }, "repair_interval": "1h0m0s", "rebroadcast_interval": "10s" } }, "api": { "ipfsproxy": { "listen_multiaddress": "/ip4/127.0.0.1/tcp/9095", "node_multiaddress": "/ip4/127.0.0.1/tcp/5001", "log_file": "", "read_timeout": "0s", "read_header_timeout": "5s", "write_timeout": "0s", "idle_timeout": "1m0s", "max_header_bytes": 4096 }, "pinsvcapi": { "http_listen_multiaddress": "/ip4/127.0.0.1/tcp/9097", "read_timeout": "0s", "read_header_timeout": "5s", "write_timeout": "0s", "idle_timeout": "2m0s", "max_header_bytes": 4096, "basic_auth_credentials": null, "http_log_file": "", "headers": {}, "cors_allowed_origins": [ "*" ], "cors_allowed_methods": [ "GET" ], "cors_allowed_headers": [], "cors_exposed_headers": [ "Content-Type", "X-Stream-Output", "X-Chunked-Output", "X-Content-Length" ], "cors_allow_credentials": true, "cors_max_age": "0s" }, "restapi": { "http_listen_multiaddress": "/ip4/127.0.0.1/tcp/9094", "read_timeout": "0s", "read_header_timeout": "5s", "write_timeout": "0s", "idle_timeout": "2m0s", "max_header_bytes": 4096, "basic_auth_credentials": null, "http_log_file": "", "headers": {}, "cors_allowed_origins": [ "*" ], "cors_allowed_methods": [ "GET" ], "cors_allowed_headers": [], "cors_exposed_headers": [ "Content-Type", "X-Stream-Output", "X-Chunked-Output", "X-Content-Length" ], "cors_allow_credentials": true, "cors_max_age": "0s" } }, "ipfs_connector": { "ipfshttp": { "node_multiaddress": "/ip4/127.0.0.1/tcp/5001", "connect_swarms_delay": "30s", "ipfs_request_timeout": "10m", "pin_timeout": "20s", "unpin_timeout": "3h0m0s", "repogc_timeout": "24h0m0s", "informer_trigger_interval": 0 } }, "pin_tracker": { "stateless": { "concurrent_pins": 20, "priority_pin_max_age": "24h0m0s", "priority_pin_max_retries": 5 }, "concurrent_pins": 20 }, "monitor": { "pubsubmon": { "check_interval": "15s" } }, "allocator": { "balanced": { "allocate_by": [ "tag:group", "freespace" ] } }, "informer": { "disk": { "metric_ttl": "30s", "metric_type": "freespace" }, "pinqueue": { "metric_ttl": "30s", "weight_bucket_size": 100000 }, "tags": { "metric_ttl": "30s", "tags": { "group": "default" } } }, "observations": { "metrics": { "enable_stats": true, "prometheus_endpoint": "/ip4/127.0.0.1/tcp/8888", "reporting_interval": "5s" }, "tracing": { "enable_tracing": false, "jaeger_agent_endpoint": "/ip4/0.0.0.0/udp/6831", "sampling_prob": 0.3, "service_name": "cluster-daemon" } }, "datastore": { "pebble": { "pebble_options": { "cache_size_bytes": 1073741824, "bytes_per_sync": 1048576, "disable_wal": false, "flush_delay_delete_range": 0, "flush_delay_range_key": 0, "flush_split_bytes": 4194304, "format_major_version": 1, "l0_compaction_file_threshold": 750, "l0_compaction_threshold": 4, "l0_stop_writes_threshold": 12, "l_base_max_bytes": 134217728, "max_open_files": 1000, "mem_table_size": 67108864, "mem_table_stop_writes_threshold": 20, "read_only": false, "wal_bytes_per_sync": 0, "levels": [ { "block_restart_interval": 16, "block_size": 4096, "block_size_threshold": 90, "compression": 2, "filter_type": 0, "filter_policy": 10, "index_block_size": 4096, "target_file_size": 4194304 }, { "block_restart_interval": 16, "block_size": 4096, "block_size_threshold": 90, "compression": 2, "filter_type": 0, "filter_policy": 10, "index_block_size": 4096, "target_file_size": 8388608 }, { "block_restart_interval": 16, "block_size": 4096, "block_size_threshold": 90, "compression": 2, "filter_type": 0, "filter_policy": 10, "index_block_size": 4096, "target_file_size": 16777216 }, { "block_restart_interval": 16, "block_size": 4096, "block_size_threshold": 90, "compression": 2, "filter_type": 0, "filter_policy": 10, "index_block_size": 4096, "target_file_size": 33554432 }, { "block_restart_interval": 16, "block_size": 4096, "block_size_threshold": 90, "compression": 2, "filter_type": 0, "filter_policy": 10, "index_block_size": 4096, "target_file_size": 67108864 }, { "block_restart_interval": 16, "block_size": 4096, "block_size_threshold": 90, "compression": 2, "filter_type": 0, "filter_policy": 10, "index_block_size": 4096, "target_file_size": 134217728 }, { "block_restart_interval": 16, "block_size": 4096, "block_size_threshold": 90, "compression": 2, "filter_type": 0, "filter_policy": 10, "index_block_size": 4096, "target_file_size": 268435456 } ] } } }, "metrics": { "enable_stats": true, "prometheus_endpoint": "/ip4/127.0.0.1/tcp/8888", "reporting_interval": "5s" } } ```