query: fix retry query case.

ziggie1984 commented 4 months ago

Fixes https://github.com/lightningnetwork/lnd/issues/8593

In general I will rework this code area in LND 19 because this needs an overhaul and relates to https://github.com/btcsuite/btcwallet/issues/904

In the current code there is an edge case which manifests itself when the used backend has very unstable connections and also not a lot of outbound peers which results in queryJobs waiting a long time in the queue. Then we would timeout the batch here:

https://github.com/lightninglabs/neutrino/blob/43f5a588eb4a53cb36d118420e186629154086da/query/workmanager.go#L420-L433

but the queryJob was already registered for a second try here:

https://github.com/lightninglabs/neutrino/blob/43f5a588eb4a53cb36d118420e186629154086da/query/workmanager.go#L388-L389

Now this query will be tried all over again and never purged from the queue and due to how the heap map works it will always retry the query with the lowest index number. So this will cause us to never try new queries but always retry the old ones.

ziggie1984 commented 4 months ago

Maybe we should also increase the timeouts/retries while we are here, so that we can mitigate cases like this: https://github.com/lightningnetwork/lnd/pull/8497. Where the default 2 retries with a timeout of 2 and then 4 seconds just aren't enough also taking the overall timeout of the batch of 30 seconds into account.

2 sec timeout: https://github.com/lightninglabs/neutrino/blob/43f5a588eb4a53cb36d118420e186629154086da/query/workmanager.go#L14

4 sec timeout after first failed retry:

https://github.com/lightninglabs/neutrino/blob/43f5a588eb4a53cb36d118420e186629154086da/query/workmanager.go#L375-L379

Overall batchtimeout 30 sec: https://github.com/lightninglabs/neutrino/blob/43f5a588eb4a53cb36d118420e186629154086da/query/interface.go#L10-L12

Retries 2:

https://github.com/lightninglabs/neutrino/blob/43f5a588eb4a53cb36d118420e186629154086da/query/interface.go#L18-L20

ziggie1984 commented 4 months ago

Ok I came up with a slightly different approach, we will now use a cancel channel and close it instead of looping through the queryJob array, maybe we should do both (close channel to remove active requests and remove all the one still queued, I think just the first one is enough) ?. Let me know if its even a more hacky way, tho have to think about a new test for this.

ProofOfKeags commented 4 months ago

Ok I came up with a slightly different approach, we will now use a cancel channel and close it instead of looping through the queryJob array, maybe we should do both (close channel to remove active requests and remove all the one still queued, I think just the first one is enough) ?. Let me know if its even a more hacky way, tho have to think about a new test for this.

Don't worry about it, this is fine enough for a change this small. I would like to see neutrino undergo some refactoring at some point but I don't think we should turn your 1-commit fix into that project.

ziggie1984 commented 4 months ago

Sounds good, I am writing a test for the new approach and then we ship it until we get the new refactor into the codebase

ziggie1984 commented 4 months ago

@ProofOfKeags actually the former version had a bug in it, it would crash in some circumstances which was only revealed to me while adding a unit test for this case. We cannot just remove the heap entries for the queries because they might already be registered with the workers. So the prior version would crash.

Good reminder to never ack something without tests... 😅

ProofOfKeags commented 4 months ago

Good reminder to never ack something without tests...

Good catch. I'll admit that the way we do testing makes it really hard for me to tell whether the tests are actually good or not. I'm actively trying to improve our library infrastructure. The better we factorize things, the smaller the tests will be and the easier they will be to evaluate too. It requires a lot of discipline though and time though.

Chinwendu20 commented 4 months ago

@ProofOfKeags actually the former version had a bug in it, it would crash in some circumstances which was only revealed to me while adding a unit test for this case. We cannot just remove the heap entries for the queries because they might already be registered with the workers. So the prior version would crash.

Interesting, I thought if they had already been registered with a worker, it would have been deleted from the heap, then and so won't be present in the purge you did earlier. https://github.com/lightninglabs/neutrino/blob/1ef869ffea1a88333dbb2c8e846e384ea6e29390/query/workmanager.go#L244

Good reminder to never ack something without tests... 😅

A learning point for me as well.

This approach works too, just that we would have to wait to schedule the job that has the batch cancelled already before it gets a cancel signal.

ziggie1984 commented 4 months ago

Interesting, I thought if they had already been registered with a worker, it would have been deleted from the heap, then and so won't be present in the purge you did earlier.

Yes so the main problem is that as soon as a job is Pop()-ed from the heap it is not immediately removed from the currentQueries map, so when the batch times out and registered queries are still ongoing with the workers I would try to delete queries which are already Pop()-ed from the heap. I think the way its now, is better because it uses channels in a clean way. But as said at the beginning and also underlined by @ProofOfKeags this code part needs a refactoring.

lightninglabs-deploy commented 4 months ago

@ellemouton: review reminder @ziggie1984, remember to re-request review from reviewers when ready

ProofOfKeags commented 4 months ago

I think we should instead have an internal-only way of canceling queries that is separate from the way that a caller cancels queries

Sounds like I should revive my work on the Async API. 😅

ziggie1984 commented 4 months ago

Nice input @ellemouton 🙏, introduced a new internalCancelChannel for the query to distinguish both cases.

ellemouton commented 4 months ago

@ziggie1984 - before we merge this, can you open an LND PR that points to this so we can make sure the CI passes?

ziggie1984 commented 3 months ago

@guggero I think we can merge this now, itests passed on the LND PR (https://github.com/lightningnetwork/lnd/pull/8621)

lightninglabs / neutrino

query: fix retry query case. #297