pokt-network / poktroll

The official Shannon upgrade implementation of the Pocket Network Protocol implemented using the Cosmos SDK
MIT License
15 stars 8 forks source link

[Living Ticket] Scalability related efforts #621

Open okdas opened 4 months ago

okdas commented 4 months ago

Objective

Ensure that Shannon scales both on-chain & off-chain.

Origin Document

This issue is intended to be a living document to keep track of all related efforts.

Identified issues and points of investigation

Things to investigate:


Creator: @okdas Co-Owners: @red-0ne @bryanchriswhite @Olshansk

Olshansk commented 4 months ago

@okdas Made some changes, updates & improements to this ticket. PTAL

okdas commented 2 months ago

To investigate - ran into a panic - we potentially not handling the error from the RPC gracefully:

Panic Error

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x3b15c58]

Goroutine Stack Trace

goroutine 297 [running]:
github.com/pokt-network/poktroll/pkg/relayer/session.(*sessionTree).Delete(0x4003e33040)
    /Users/dk/pocket/poktroll/pkg/relayer/session/sessiontree.go:250 +0xc8

github.com/pokt-network/poktroll/pkg/relayer/session.(*relayerSessionsManager).deleteExpiredSessionTreesFn.func1({0x51e5e00, 0x4000c07830}, {0x4001675aa0, 0x1, 0x1})
    /Users/dk/pocket/poktroll/pkg/relayer/session/session.go:456 +0x278

github.com/pokt-network/poktroll/pkg/observable/channel.ForEach[...].func1({0x4001675aa0, 0x1, 0x1})
    /Users/dk/pocket/poktroll/pkg/observable/channel/map.go:103 +0x6c

github.com/pokt-network/poktroll/pkg/observable/channel.goMapTransformNotification[...]({0x51e5e00, 0x4000c07830}, {0x51df2b0, 0x400157b620}, 0x40012bd008, 0x40012bd050, 0x40012da480)
    /Users/dk/pocket/poktroll/pkg/observable/channel/map.go:125 +0xc4

created by github.com/pokt-network/poktroll/pkg/observable/channel.Map[...] in goroutine 1
    /Users/dk/pocket/poktroll/pkg/observable/channel/map.go:24 +0x318

Related Log Messages

2024-08-16 17:19:22.783    {"level":"debug","message":"deleting expired session"}

2024-08-16 17:19:22.781    {"level":"error","error":"with hash: a451156fe642c5f425af9bc1818ae423307789be0a4c581d26621f7fc698a419: error in json rpc client, with http response metadata: (Status: 200 OK, Protocol HTTP/1.1). RPC error -32603 - Internal error: tx (A451156FE642C5F425AF9BC1818AE423307789BE0A4C581D26621F7FC698A419) not found: error encountered while querying for tx","message":"failed to create claims"}

2024-08-16 17:19:22.783    {"level":"error","error":"with hash: a451156fe642c5f425af9bc1818ae423307789be0a4c581d26621f7fc698a419: error in json rpc client, with http response metadata: (Status: 200 OK, Protocol HTTP/1.1). RPC error -32603 - Internal error: tx (A451156FE642C5F425AF9BC1818AE423307789BE0A4C581D26621F7FC698A419) not found: error encountered while querying for tx"}
okdas commented 2 months ago

To investigate. Given the nature of RelayMiner we need it to try to recover first.

RelayMiner stops on: {"level":"error","work_name":"goPublishEvents","error":"eventsqueryclient connection closed","message":"on retry: 1"}

Olshansk commented 2 months ago

@okdas This is related to the observable, so I think we may be reaching a place where:

  1. A deadlock happens (or something mutex related)
  2. The observable is blocked on events (either empty or too many)

Do you mind created a dedicated ticket to your comment here for @bryanchriswhite to tackle?

okdas commented 1 month ago

There are some new issues uncovered by #742 (more details in that ticket) - so far nothing super critical, and we address issues as we find them.

okdas commented 6 days ago

Just to provide an update: we've been finding and resolving different issues, mostly in scope of #742 lately.