query: end completed queries in sendTransaction before timeout

chappjc commented 2 years ago

(*ChainService).sendTransaction always times out even if each peer does respond. This is because after receiving a getdata message in response to the outgoing inv, the checkResponse closure provided to queryAllPeers neglects to close the peerQuit channel. As a result, sendTransaction always blocks for the full configured BroadcastTimeout.

This change adds the missing close(peerQuit) when a matching response (either a getdata or reject) is received. I have tested this fix in our app, and sends no longer always time out when all peers respond with a getdata.

This change also makes a note in the sendTransaction function that the if numReplied == len(rejections) { test at the end is currently impossible to hit since numReplied would also need to be incremented when rejections are recorded. I did not fix this however since it would potentially start returning non-nil errors for harmless rejections, such as "already-in-mempool", "already have transaction", and "transaction already exists" (pushtx.Mempool and pushtx.Confirmed). There are existing TODO comments from @wpaulino (wilmer) to this effect. I did add a debug log though to help characterize what rejection messages are actually received in the wild.

Finally, this adds a comment in (*blockManager).fetchFilterFromAllPeers, which also appears to be lacking a close(peerQuit) in it's queryAllPeers check closure. I added a debug log instead. I'm pretty certain this fix is needed here too, but this code is only called from (*blockManager).detectBadPeers, and I haven't hit that in my tests yet today.

chappjc commented 2 years ago

Uh oh, random CI error after squashing. https://github.com/lightninglabs/neutrino/runs/4853905111?check_suite_focus=true#step:4:160

 --- FAIL: TestNeutrinoSync (80.86s)
    --- FAIL: TestNeutrinoSync/start_long-running_rescan (39.21s)
        sync_test.go:535: Wrong number of relevant transactions. Want: 2, got: 1

chappjc commented 2 years ago

Well darn I'm not sure what the test failure is about, but it must be related. I have been running the unit tests prior to pushing so it's unfortunately a transient failure:

--- PASS: TestNeutrinoSync (115.56s)
    --- PASS: TestNeutrinoSync/initial_sync (3.10s)
    --- PASS: TestNeutrinoSync/one-shot_rescan (3.10s)
    --- PASS: TestNeutrinoSync/start_long-running_rescan (48.75s)
    --- PASS: TestNeutrinoSync/test_blocks_and_filters_in_random_order (0.15s)
    --- PASS: TestNeutrinoSync/check_long-running_rescan_results (58.73s)
PASS
coverage: 63.1% of statements

Looking closer now, but I'm not familiar with this unit test and nothing is jumping out yet.

chappjc commented 2 years ago

Looks like the tests are assuming the SendTransaction will return after the MsgTx was actually sent. sp.QueueMessageWithEncoding is an async send, and if SendTransaction does not always time out, it's more likely that the subsequent harness.h1.Node.Generate(1) in the test will not include the transaction having not yet received it.

https://github.com/lightninglabs/neutrino/blob/53b628ce175698d112ddff3b60856902556c2fc2/sync_test.go#L516-L520

I don't think timeout of SendTransaction that we're fixing here was by-design, but it did introduce an artificial Sleep between queuing the MsgTx send and generating the next block that increased the likelihood of the mining node having received and accepted it for work prior to generating the next block. AFAICT, there was still no confirmation of the MsgTx actually being sent, only queued, but there was an artificial delay before returning that made this more likely.

To keep the test working reliably on slower single core machines like CI probably uses, a time.Sleep(5*time.Second) would need to be inserted between SendTransaction and Generate, to reproduce what SendTransaction previously did. Alternatively, a getrawtransaction loop on harness.h1.Node could be rigged. I hate to suggest a time.Sleep in a test, but that would emulate the delay that is now gone.

chappjc commented 2 years ago

The assumption in the sync tests was indeed that when SendTransaction returned without error that the miner node would be ready to include that in a block, which worked when it always timed out because it gave the peer's out handler time to send it and the node time to accept it (although it was never guaranteed). I hope this update to the test is acceptable. The transient failure without the test update is actually an indication that the fix in this PR is working as intended.

lightninglabs-deploy commented 2 years ago

@chappjc, remember to re-request review from reviewers when ready

Roasbeef commented 2 years ago

Needs a rebase!

lightninglabs / neutrino

query: end completed queries in sendTransaction before timeout #240