pokt-network / poktroll

The official Shannon upgrade implementation of the Pocket Network Protocol implemented using the Cosmos SDK
MIT License
16 stars 8 forks source link

[bug] Goroutines leaks on RelayMiner/AppGate #354

Closed okdas closed 5 months ago

okdas commented 9 months ago

Objective

Both AppGate and RelayMiner creating, but not cleaning up goroutines at approx 2 goroutines per relay.

Origin Document

image12312

Goals

How to reproduce

  1. Start LocalNet: make localnet_up
  2. Stake actors and run e2e tests to make sure LocalNet works: make supplier1_stake && make app1_stake && make test_e2e
  3. Get the current numbers of goroutines:

AppGate:

curl localhost:9093 | grep go_goroutines
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 77

RelayMiner:

curl localhost:9094 | grep go_goroutines
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 846

(This number already seems kind of high after runnung e2e tests)

  1. Now, run basic load test: make load_test_simple
  2. Check the number of routines again:

AppGate:

curl localhost:9093 | grep go_goroutines
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 11743

RelayMiner:

curl localhost:9094 | grep go_goroutines
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 18219

General deliverables


Creator: @okdas Co-Owners: @red-0ne @okdas

Olshansk commented 9 months ago

CCing @ezeike @adshmh as well. This is the first (of many) tickets where protocol & backend work will start to overlap.

red-0ne commented 8 months ago

Go routines leaks has been significantly reduced by #410. Providing results to compare against the numbers resulting from the load test need to be done on the same hardware.

@okdas , could you please run load tests again on PR #410's branch with the same setup so we can compare numbers?

Olshansk commented 8 months ago

@red-0ne

could you please run load tests again on PR https://github.com/pokt-network/poktroll/pull/410's branch with the same setup so we can compare numbers?

Do we not have a way to reproduce this locally?

I recall reviewing & updating the instructions here: https://dev.poktroll.com/infrastructure/testing/load_testing

red-0ne commented 8 months ago

The leaks count is request dependent, and reference numbers would not be the same as the load testing script is 100VUs x 1minute with as much requests as possible. So numbers may differ from one machine to the other.

Olshansk commented 8 months ago

The leaks count is request dependent, and reference numbers would not be the same as the load testing script is 100VUs x 1minute with as much requests as possible. So numbers may differ from one machine to the other.

Got it.

My goal is not to have a specific number but understand how to reproduce it.

For example:

  1. If I wanted to check/observe/see if there are any leaks on my LocalNet, which doc do I read/follow to do this?
  2. If I wanted to check/observe/see if there are any leaks on my DevNet, which doc do I read/follow to do this?
red-0ne commented 8 months ago

The leaks count is request dependent, and reference numbers would not be the same as the load testing script is 100VUs x 1minute with as much requests as possible. So numbers may differ from one machine to the other.

Got it.

My goal is not to have a specific number but understand how to reproduce it.

For example:

  1. If I wanted to check/observe/see if there are any leaks on my LocalNet, which doc do I read/follow to do this?
  2. If I wanted to check/observe/see if there are any leaks on my DevNet, which doc do I read/follow to do this?
  1. Have localnet up
  2. issue curl localhost:PORT | grep go_goroutines (9003 for appgate server, 9004 for relay miner)
  3. make send_relay
  4. do 2. again and see the difference for a single relay.

I will document this in docusaurus. Will create a ticket to capture it.

Olshansk commented 7 months ago

@red-0ne posts an update next friday on how many go routine leaks we have. The goal is zero. If there is a VERY clear path to getting to zero, keep this open. If there's isn't, close it out and new tickets will be created when necessary.

Olshansk commented 5 months ago

@okdas Do you think we have resolved enough go-routines to close this one out?

okdas commented 5 months ago

@Olshansk looking at resource utilization on DevNets - we still saturate CPU when not serving requests. I'll get some pprof snapshots from these nodes so we can investigate.

Olshansk commented 5 months ago

Perfect, thanks for the update @okdas!

okdas commented 5 months ago

I think we're in a good position now with one exception we are going to work on - #551.