Closed teslashibe closed 2 months ago
@restevens402 @mudler @Luka-Loncar add this to no-status. Found a bug here. The default action should be to increase the CPU and RAM on the AWS node before software optimizations.
@5u6r054 to assist on increasing machine specs on Monday 👍
Bug: Resource Limit Exceeded When Opening Outbound Stream
Problem Description
When attempting to open an outbound stream to a remote worker, we're encountering a resource limit error. This is preventing successful communication with worker nodes and impacting the network's ability to distribute tasks.
This error was observed on our
gold-1
node in AWS and was resolved with a node restart.Error message:
logs
Current Resource Management
Our Oracle Node currently uses libp2p's default resource management configuration with auto-scaling:
Implementation here
(line 111)
: https://github.com/masa-finance/masa-oracle/blob/main/pkg/oracle_node.goThis is using libp2p's
limit.go
andlimit_defaults.go
from thegithub.com/libp2p/go-libp2p/p2p/host/resource-manager
package.limit_defaults.go
:This line is using the
DefaultLimits
variable defined inlimit_defaults.go
. It's aScalingLimitConfig
that provides default values for various resource limits.limit_defaults.go
:This calls the
AutoScale()
method ofScalingLimitConfig
, which in turn calls theScale()
method with automatically determined memory and file descriptor values. TheScale()
method applies the scaling logic to create aConcreteLimitConfig
.limit.go
:This uses the
NewFixedLimiter
function defined inlimit.go
to create a newfixedLimiter
with the concrete limits.limit.go
:This creates a new
ResourceManager
using thefixedLimiter
we just created.The current configuration is using the default scaling limits provided by libp2p, which are then auto-scaled based on the system's available resources.
DefaultLimits
provides a base configuration for various resource limits (connections, streams, memory, etc.) and how they should scale with available resources.AutoScale()
determines the available system resources (memory and file descriptors) and scales the limits accordingly.The resulting
ConcreteLimitConfig
is used to create afixedLimiter
, which enforces these limits.The
ResourceManager
is created with this limiter, which will then enforce these limits throughout the libp2p node's lifecycle.We might want to:
DefaultLimits
before callingAutoScale()
.AutoScale()
.ConcreteLimitConfig
with values tailored to your application's needs.For example, if you we want to increase the number of inbound connections allowed, we could do something like:
This would allow for more inbound connections in both the base case and as the system scales up.
This setup allows for some automatic scaling of resources based on the system's capabilities, but it doesn't include any custom limits or fine-tuning for our specific needs.
Impact
Suggested Solutions
Custom Resource Limits: Implement custom resource limits tailored to our network's needs. For example:
Dynamic Limiter: Replace the fixed limiter with a scaling limiter for more adaptive resource allocation:
Resource Usage Monitoring: Implement a monitoring system to track resource usage and adjust limits dynamically.
Increase Cloud Infrastructure Limits: Review and potentially increase the limits on our cloud infrastructure, particularly for network-related resources.
Connection Management: Implement more aggressive connection management, including closing idle connections and limiting the maximum number of concurrent connections.
Backoff and Retry Mechanism: Implement a backoff and retry mechanism when encountering resource limit errors to prevent immediate resource exhaustion.
Next Steps