bug(protocol): resource limit exceeded when opening outbound stream

Bug: Resource Limit Exceeded When Opening Outbound Stream

Problem Description

When attempting to open an outbound stream to a remote worker, we're encountering a resource limit error. This is preventing successful communication with worker nodes and impacting the network's ability to distribute tasks.

This error was observed on our gold-1 node in AWS and was resolved with a node restart.

Error message:

error sending work to worker: : error opening stream: failed to open stream: stream-57946: transient: cannot reserve outbound stream: resource limit exceeded

logs

time="2024-08-24T19:24:14Z" level=info msg="WorkerType is related to Twitter"
time="2024-08-24T19:24:14Z" level=info msg="Checking connections to eligible workers"
time="2024-08-24T19:24:14Z" level=info msg="Worker selection took 0 milliseconds"
time="2024-08-24T19:24:14Z" level=info msg="Starting round-robin worker selection"
time="2024-08-24T19:24:14Z" level=info msg="Attempting remote worker 16Uiu2HAm9GNsYsuXkLGM8r7bzoh2nFxkocvuUi9oFeaMUqdVPTB4 (attempt 1/10)"
time="2024-08-24T19:24:14Z" level=error msg="error sending work to worker: : error opening stream: failed to open stream: stream-57946: transient: cannot reserve outbound stream: resource limit exceeded"
time="2024-08-24T19:24:14Z" level=info msg="Remote worker 16Uiu2HAm9GNsYsuXkLGM8r7bzoh2nFxkocvuUi9oFeaMUqdVPTB4 failed, moving to next worker"
time="2024-08-24T19:24:32Z" level=info msg="Node left: /ip4/47.157.92.220/udp/1028/quic-v1/p2p/16Uiu2HAm9GNsYsuXkLGM8r7bzoh2nFxkocvuUi9oFeaMUqdVPTB4"
time="2024-08-24T19:24:32Z" level=info msg="[+] Staked node joined: /ip4/47.157.92.220/udp/4001/quic-v1/p2p/16Uiu2HAm9GNsYsuXkLGM8r7bzoh2nFxkocvuUi9oFeaMUqdVPTB4"

Current Resource Management

Our Oracle Node currently uses libp2p's default resource management configuration with auto-scaling:

Implementation here (line 111): https://github.com/masa-finance/masa-oracle/blob/main/pkg/oracle_node.go

scalingLimits := rcmgr.DefaultLimits
concreteLimits := scalingLimits.AutoScale()
limiter := rcmgr.NewFixedLimiter(concreteLimits)
resourceManager, err := rcmgr.NewResourceManager(limiter)

This is using libp2p's limit.go and limit_defaults.go from the github.com/libp2p/go-libp2p/p2p/host/resource-manager package.

From limit_defaults.go:

scalingLimits := rcmgr.DefaultLimits

This line is using the DefaultLimits variable defined in limit_defaults.go. It's a ScalingLimitConfig that provides default values for various resource limits.

Still from limit_defaults.go:

concreteLimits := scalingLimits.AutoScale()

This calls the AutoScale() method of ScalingLimitConfig, which in turn calls the Scale() method with automatically determined memory and file descriptor values. The Scale() method applies the scaling logic to create a ConcreteLimitConfig.

From limit.go:

limiter := rcmgr.NewFixedLimiter(concreteLimits)

This uses the NewFixedLimiter function defined in limit.go to create a new fixedLimiter with the concrete limits.

Also from limit.go:

resourceManager, err := rcmgr.NewResourceManager(limiter)

This creates a new ResourceManager using the fixedLimiter we just created.

The current configuration is using the default scaling limits provided by libp2p, which are then auto-scaled based on the system's available resources.

DefaultLimits provides a base configuration for various resource limits (connections, streams, memory, etc.) and how they should scale with available resources.
AutoScale() determines the available system resources (memory and file descriptors) and scales the limits accordingly.
The resulting ConcreteLimitConfig is used to create a fixedLimiter, which enforces these limits.
The ResourceManager is created with this limiter, which will then enforce these limits throughout the libp2p node's lifecycle.

We might want to:

Customize the DefaultLimits before calling AutoScale().
Implement our own scaling logic instead of using AutoScale().
Use a different type of limiter (e.g., a scaling limiter) instead of a fixed one.
Directly configure a ConcreteLimitConfig with values tailored to your application's needs.

For example, if you we want to increase the number of inbound connections allowed, we could do something like:

scalingLimits := rcmgr.DefaultLimits
scalingLimits.SystemBaseLimit.ConnsInbound = 128  // Double the default
scalingLimits.SystemLimitIncrease.ConnsInbound = 128  // Double the default increase
concreteLimits := scalingLimits.AutoScale()

This would allow for more inbound connections in both the base case and as the system scales up.

This setup allows for some automatic scaling of resources based on the system's capabilities, but it doesn't include any custom limits or fine-tuning for our specific needs.

Impact

Failure to open streams to remote workers
Potential bottleneck in task distribution
Reduced network efficiency and responsiveness

Next Steps

Analyze current resource usage patterns
Implement and test custom resource limits
Set up monitoring for resource usage
Review and adjust cloud infrastructure limits
Implement connection management improvements
Develop and integrate a backoff and retry mechanism

masa-finance / masa-oracle