masa-finance / masa-oracle

Masa Oracle: Decentralized Data Protocol 🌐
https://developers.masa.ai/docs/masa-protocol/welcome
MIT License
22 stars 18 forks source link

bug(protocol): resource limit exceeded when opening outbound stream #528

Closed teslashibe closed 2 months ago

teslashibe commented 2 months ago

Bug: Resource Limit Exceeded When Opening Outbound Stream

Problem Description

When attempting to open an outbound stream to a remote worker, we're encountering a resource limit error. This is preventing successful communication with worker nodes and impacting the network's ability to distribute tasks.

This error was observed on our gold-1 node in AWS and was resolved with a node restart.

Error message:

error sending work to worker: : error opening stream: failed to open stream: stream-57946: transient: cannot reserve outbound stream: resource limit exceeded

logs

time="2024-08-24T19:24:14Z" level=info msg="WorkerType is related to Twitter"
time="2024-08-24T19:24:14Z" level=info msg="Checking connections to eligible workers"
time="2024-08-24T19:24:14Z" level=info msg="Worker selection took 0 milliseconds"
time="2024-08-24T19:24:14Z" level=info msg="Starting round-robin worker selection"
time="2024-08-24T19:24:14Z" level=info msg="Attempting remote worker 16Uiu2HAm9GNsYsuXkLGM8r7bzoh2nFxkocvuUi9oFeaMUqdVPTB4 (attempt 1/10)"
time="2024-08-24T19:24:14Z" level=error msg="error sending work to worker: : error opening stream: failed to open stream: stream-57946: transient: cannot reserve outbound stream: resource limit exceeded"
time="2024-08-24T19:24:14Z" level=info msg="Remote worker 16Uiu2HAm9GNsYsuXkLGM8r7bzoh2nFxkocvuUi9oFeaMUqdVPTB4 failed, moving to next worker"
time="2024-08-24T19:24:32Z" level=info msg="Node left: /ip4/47.157.92.220/udp/1028/quic-v1/p2p/16Uiu2HAm9GNsYsuXkLGM8r7bzoh2nFxkocvuUi9oFeaMUqdVPTB4"
time="2024-08-24T19:24:32Z" level=info msg="[+] Staked node joined: /ip4/47.157.92.220/udp/4001/quic-v1/p2p/16Uiu2HAm9GNsYsuXkLGM8r7bzoh2nFxkocvuUi9oFeaMUqdVPTB4"

Current Resource Management

Our Oracle Node currently uses libp2p's default resource management configuration with auto-scaling:

Implementation here (line 111): https://github.com/masa-finance/masa-oracle/blob/main/pkg/oracle_node.go

scalingLimits := rcmgr.DefaultLimits
concreteLimits := scalingLimits.AutoScale()
limiter := rcmgr.NewFixedLimiter(concreteLimits)
resourceManager, err := rcmgr.NewResourceManager(limiter)

This is using libp2p's limit.go and limit_defaults.go from the github.com/libp2p/go-libp2p/p2p/host/resource-manager package.

  1. From limit_defaults.go:
scalingLimits := rcmgr.DefaultLimits

This line is using the DefaultLimits variable defined in limit_defaults.go. It's a ScalingLimitConfig that provides default values for various resource limits.

  1. Still from limit_defaults.go:
concreteLimits := scalingLimits.AutoScale()

This calls the AutoScale() method of ScalingLimitConfig, which in turn calls the Scale() method with automatically determined memory and file descriptor values. The Scale() method applies the scaling logic to create a ConcreteLimitConfig.

  1. From limit.go:
limiter := rcmgr.NewFixedLimiter(concreteLimits)

This uses the NewFixedLimiter function defined in limit.go to create a new fixedLimiter with the concrete limits.

  1. Also from limit.go:
resourceManager, err := rcmgr.NewResourceManager(limiter)

This creates a new ResourceManager using the fixedLimiter we just created.

The current configuration is using the default scaling limits provided by libp2p, which are then auto-scaled based on the system's available resources.

  1. DefaultLimits provides a base configuration for various resource limits (connections, streams, memory, etc.) and how they should scale with available resources.

  2. AutoScale() determines the available system resources (memory and file descriptors) and scales the limits accordingly.

  3. The resulting ConcreteLimitConfig is used to create a fixedLimiter, which enforces these limits.

  4. The ResourceManager is created with this limiter, which will then enforce these limits throughout the libp2p node's lifecycle.

We might want to:

  1. Customize the DefaultLimits before calling AutoScale().
  2. Implement our own scaling logic instead of using AutoScale().
  3. Use a different type of limiter (e.g., a scaling limiter) instead of a fixed one.
  4. Directly configure a ConcreteLimitConfig with values tailored to your application's needs.

For example, if you we want to increase the number of inbound connections allowed, we could do something like:

scalingLimits := rcmgr.DefaultLimits
scalingLimits.SystemBaseLimit.ConnsInbound = 128  // Double the default
scalingLimits.SystemLimitIncrease.ConnsInbound = 128  // Double the default increase
concreteLimits := scalingLimits.AutoScale()

This would allow for more inbound connections in both the base case and as the system scales up.

This setup allows for some automatic scaling of resources based on the system's capabilities, but it doesn't include any custom limits or fine-tuning for our specific needs.

Impact

Suggested Solutions

  1. Custom Resource Limits: Implement custom resource limits tailored to our network's needs. For example:

    scalingLimits.SystemBaseLimit.Streams = 8000
    scalingLimits.SystemBaseLimit.StreamsOutbound = 1000
    scalingLimits.SystemBaseLimit.Memory = 4 << 30 // 4 GB
  2. Dynamic Limiter: Replace the fixed limiter with a scaling limiter for more adaptive resource allocation:

    limiter := rcmgr.NewDefaultLimiterFromScalingLimits(scalingLimits)
  3. Resource Usage Monitoring: Implement a monitoring system to track resource usage and adjust limits dynamically.

  4. Increase Cloud Infrastructure Limits: Review and potentially increase the limits on our cloud infrastructure, particularly for network-related resources.

  5. Connection Management: Implement more aggressive connection management, including closing idle connections and limiting the maximum number of concurrent connections.

  6. Backoff and Retry Mechanism: Implement a backoff and retry mechanism when encountering resource limit errors to prevent immediate resource exhaustion.

Next Steps

  1. Analyze current resource usage patterns
  2. Implement and test custom resource limits
  3. Set up monitoring for resource usage
  4. Review and adjust cloud infrastructure limits
  5. Implement connection management improvements
  6. Develop and integrate a backoff and retry mechanism
teslashibe commented 2 months ago

@restevens402 @mudler @Luka-Loncar add this to no-status. Found a bug here. The default action should be to increase the CPU and RAM on the AWS node before software optimizations.

@5u6r054 to assist on increasing machine specs on Monday 👍