statechannels / go-nitro

Implementation of nitro-protocol in Go
Other
37 stars 20 forks source link

Consistent failures with rpc_test #1323

Closed NiloCK closed 1 year ago

NiloCK commented 1 year ago

Since PR #1315, CI has been extremely brittle. #1315 introduced 0-hop and 2-hop virtual funding operations, mediated by the golang rpc_client, to the test suite (and CI). Previously, only the 1-hop virtual fund was under test.

recent example failures:

https://github.com/statechannels/go-nitro/actions/runs/5177367315/jobs/9327375026#step:7:578

...
rpc_test.go:423: bobVirtual received {0x981e13c023e2da0069ed544cc82c4ac7db43d7d1931eda38cca49bc127a9ef2c Complete {0x0000000000000000000000000000000000000000 0x6813Eb9362372EEF6200f3b1dbC3f819671cBA69 0x7E5F4552091A69125d5DfCb7b8C2659029395Bdf 0x1 0x63}}+
panic: send on closed channel

goroutine 2649 [running]:
github.com/statechannels/go-nitro/client/notifier.(*ledgerChannelListeners).Notify(0xc000258d20, {{0xe3, 0xb1, 0xaf, 0x5b, 0xd3, 0x40, 0x15, 0xf9, 0x1f, ...}, ...})
    /home/runner/work/go-nitro/go-nitro/client/notifier/listeners.go:89 +0x173
github.com/statechannels/go-nitro/client/notifier.(*ChannelNotifier).NotifyLedgerUpdated(0xc000a37770, {{0xe3, 0xb1, 0xaf, 0x5b, 0xd3, 0x40, 0x15, 0xf9, 0x1f, ...}, ...})
    /home/runner/work/go-nitro/go-nitro/client/notifier/channel-notifier.go:61 +0x1f8
github.com/statechannels/go-nitro/client.(*Client).handleEngineEvents(0xc0004c7860)
    /home/runner/work/go-nitro/go-nitro/client/client.go:107 +0x418
created by github.com/statechannels/go-nitro/client.New
    /home/runner/work/go-nitro/go-nitro/client/client.go:76 +0xbd3
FAIL    github.com/statechannels/go-nitro/client_test   0.935s

Here, sending on a closed channel generally indicates a go-nitro client attempting to communicate a protocol update message after it has externally been triggered to shut down (the closed channel in question is part of the event stream between go-nitro rpc-server and rpc-clients).

https://github.com/statechannels/go-nitro/actions/runs/5159120522/jobs/9293632835#step:7:751

...
     rpc_test.go:428: bobVirtual received unexpected notification: {0x81fc42e3c4f670506246ede8ed0cc8ba2cac129ae2f76ec04bd2eb9b3172ba81 Complete {0x0000000000000000000000000000000000000000 0x6813Eb9362372EEF6200f3b1dbC3f819671cBA69 0x7E5F4552091A69125d5DfCb7b8C2659029395Bdf 0x0 0x64}}
--- FAIL: TestRpcWithWebsockets (0.24s)

Here, failing with an unexpected notification - indication that a (maybe severe?) message reordering has occurred. Should be a rare flicker, but happening frequently since #1315.

https://github.com/statechannels/go-nitro/pull/1318 is a stub PR that retains the refactors to rpc_test.go present in #1315, but removes the 0-hop and 2-hop test runs. It has a better, but still not great, success rate.

NiloCK commented 1 year ago

Follow-up:

The second error above is a bit of a head scratcher. The assertions in the test setup are:

        requiredVCNotifs := createPaychStory(
        vabCreateResponse.ChannelId, alice.Address(), bob.Address(),
        []channelStatusShorthand{
            {100, 0, query.Proposed},
            {100, 0, query.Open},
            {99, 1, query.Complete},
        },
    )

    optionalVCNotifs := createPaychStory(
        vabCreateResponse.ChannelId, alice.Address(), bob.Address(),
        []channelStatusShorthand{
            {99, 1, query.Closing},
            // TODO: Sometimes we see a closing notification with the original balance.
            // See https://github.com/statechannels/go-nitro/issues/1306
            {99, 1, query.Open},
            {100, 0, query.Closing},
        },
    )

    aliceVirtualNotifs := aliceClient.PaymentChannelUpdatesChan(vabCreateResponse.ChannelId)
    checkNotifications(t, "aliceVirtual", requiredVCNotifs, optionalVCNotifs, aliceVirtualNotifs, defaultTimeout)
    bobVirtualNotifs := bobClient.PaymentChannelUpdatesChan(vabCreateResponse.ChannelId)
    checkNotifications(t, "bobVirtual", requiredVCNotifs, optionalVCNotifs, bobVirtualNotifs, defaultTimeout)

Here, it's required that each of Alice and Bob's clients emit a Complete, 99,1 notification. The CI run in the second error fails on Bob's notification check, which means that the Alice client did emit that notification. But the test fails because Bob's client emits a Complete, 100, 0 notification. The two clients have reported completion with different balances, which may or may not be reflected in their respective ledger channels. In this case, they have robbed themselves for Irene's benefit - Alice is down a coin, Bob is unchanged, and Irene is up one.

From discussion in #1306, I would have expected that the out-of-order closure request to be resolved with a latest voucher update. IE, we aren't surprised for Bob to report a closure request at 100,0, but we didn't expect him to complete it at that balance.

NiloCK commented 1 year ago

https://github.com/statechannels/go-nitro/actions/runs/5190468207/jobs/9357036073#step:7:770

Here Bob's client emits complete notifications with both 99,1 and 100,0 final balances.

lalexgap commented 1 year ago

I've split of the first error into it's own issue: https://github.com/statechannels/go-nitro/issues/1330

geoknee commented 1 year ago

The second error is a duplicate of #1313 -- @NiloCK shall we close this ?

geoknee commented 1 year ago

Debugging session today with @NiloCK and @geoknee.

Bob get's the following 3 notifications in the context of virtual defunding:

    rpc_test.go:429: bobVirtual received {0x1384c293ddc9418e9cd20e8f1d794cc01e94340a2bc0e367acb559c1f540c6ed Closing {0x0000000000000000000000000000000000000000 0x6813Eb9362372EEF6200f3b1dbC3f819671cBA69 0x7E5F4552091A69125d5DfCb7b8C2659029395Bdf 0x0 0x64}}+
    rpc_test.go:429: bobVirtual received {0x1384c293ddc9418e9cd20e8f1d794cc01e94340a2bc0e367acb559c1f540c6ed Complete {0x0000000000000000000000000000000000000000 0x6813Eb9362372EEF6200f3b1dbC3f819671cBA69 0x7E5F4552091A69125d5DfCb7b8C2659029395Bdf 0x0 0x64}}+
    rpc_test.go:429: bobVirtual received {0x1384c293ddc9418e9cd20e8f1d794cc01e94340a2bc0e367acb559c1f540c6ed Complete {0x0000000000000000000000000000000000000000 0x6813Eb9362372EEF6200f3b1dbC3f819671cBA69 0x7E5F4552091A69125d5DfCb7b8C2659029395Bdf 0x1 0x63}}+

Looking at this diagram: https://github.com/statechannels/go-nitro/blob/0f743aa42a06f8af83d88c72f5894bf277921a60/protocols/virtualdefund/virtualdefund-sequence-diagram.svg

There are only two protocol cranks (triggered by receiving a message) each will trigger a notification. The other way to trigger a notification is via receiving a voucher.

Hypothesis:

  1. Alice sends a voucher with a payment of 1
  2. Alice sends a final state with a payment of 1 resolved into it
  3. Bob receives a final state first. This triggers the Closing notification
  4. Bob then receives another final state (could be from Irene, or Alice if it was Irene the first time). This triggers a Complete notification. At this point Bob has the correct balance of 1 resolved into the virtual channel, but his notification code** ignores that and uses his voucher store. Since he has no voucher yet, the notification has a payment of 0
  5. Bob gets the voucher, and another Complete notification fires, this time with the payment of 1.

** This is the offending line

https://github.com/statechannels/go-nitro/blob/bed7a897e3f99b484fceb6f064dcaf779e86cac1/client/engine/engine.go#L631-L636

We should consider whether we want to emit a payment channel update when receiving a voucher. Certainly when the channel is closed, that update is pretty meaningless:

https://github.com/statechannels/go-nitro/blob/bed7a897e3f99b484fceb6f064dcaf779e86cac1/client/engine/engine.go#L396