query+neutrino: use query dispatcher for GetBlock and GetCFilter

ellemouton commented 1 year ago

This PR refactors the GetBlock and GetCFilter methods so that they use the work dispatcher for their queries instead of using the old queryPeers function. The function is now removed bringing us one step closer to removing all query logic from the main package.

ellemouton commented 1 year ago

I assume we'll battle test all these improvements mostly indirectly through the lnd itests?

Ah, that's a good point - I should open an LND PR pointing to all these changes so that the neutrino itest suite can be run 👍

EDIT: cool, doing that here for now

ellemouton commented 1 year ago

hmmm - ok so I think the unit-race fail is a flake. Gonna re-run it...

ellemouton commented 1 year ago

I saw that the unit tests were failing in https://github.com/ellemouton/lnd/pull/69 but they all passed locally.

ah - let me open one that only points to this and not the others. Just to isolate it a bit.

ellemouton commented 1 year ago

ok, I've updated https://github.com/ellemouton/lnd/pull/69 to just point to this PR. fingers crossed for unit tests 🤞

ellemouton commented 1 year ago

erg - realised we will need to make an update in btcwallet aswell since it uses the old WorkManager struct which is now an interface 🙃

ellemouton commented 1 year ago

ok the LND unit tests seem to consistently fail (at least on GH) when pointing neutrino to this PR. I will need to investigate. I will ping on here once things are working 👍

ellemouton commented 1 year ago

alrighty! after searching for waaaayyy too many hours - I found the issue 🙈

The issue is that: the worker manager, as it currently is, doesnt have any form of "maximum number of retries" and so if any query fails, it just reschedules the job so that it can be tried again later. This is not the behaviour that the rescan code expects especially during a re-org. The rescan function wants the query to error so that it can retry itself. The reason we want the calling function to control retries is cause perhaps the query failed due to a re-org and so the actual query itself needs to be changed. So what was happening in the tests that were failing is: a query was started to get a filter for a certain block hash, then a re-org happens, and then the query manager just keeps trying the same query with the old block hash forever.

So now I have added a commit that allows the caller to set the maximum number of retries.

gonna re-request review since a new commit was added (it is the second commit)

LND things now passing: https://github.com/ellemouton/lnd/actions/runs/5046731469/jobs/9052661472?pr=69

ellemouton commented 1 year ago

have thought about it a bit more & I think I should change it a bit to allow a NumRetries of -1 which would mean no retry maximum. Just so that the behaviour for the blockManager's use of the query manager stays the same.

ellemouton commented 1 year ago

erg pushed to wrong branch. pls ignore for now. will ping when ready :)

ellemouton commented 1 year ago

Ok y'all - I think this is finally ready for another look :) The NumRetries and NoRetryMax functional options have been added.

One thing i'm not sure how to deal with and would love a second opinion on: the GetBlock and GetCFilter methods currently take in a set of neutrino.QueryOptions whereas the query.Query method takes in a set of query.QueryOptions. There is not a 1:1 mapping between these 2 sets... So the question is, should we completely change how the query.WorkManager in order to force this 1:1 mapping? or should we just add a comment saying which functional options no longer apply to GetBlock and GetCFilter?

Roasbeef commented 1 year ago

the GetBlock and GetCFilter methods currently take in a set of neutrino.QueryOptions whereas the query.Query method takes in a set of query.QueryOptions. There is not a 1:1 mapping between these 2 sets... So the question is, should we completely change how the query.WorkManager in order to force this 1:1 mapping? or should we just add a comment saying which functional options no longer apply to GetBlock and GetCFilter?

Need to check out the updated diff, but is it not possible to create a 1:1 mapping between them?

Or is this about documenting a potential breaking API change?

Roasbeef commented 1 year ago

Ah ok, I see the distinction now:

https://github.com/lightninglabs/neutrino/blob/d209c5e8abdb6d10b03b088a1f0bf40540fcf097/query.go#L140-L151

vs

https://github.com/lightninglabs/neutrino/blob/d209c5e8abdb6d10b03b088a1f0bf40540fcf097/query/interface.go#L35-L48

ellemouton commented 1 year ago

@Roasbeef - re-requesting just to get your opinion on how to move forward with the 2 separate/conflicting query option sets

ellemouton commented 1 year ago

Can we turn https://github.com/ellemouton/lnd/pull/69 into an upstream PR so we can track CI there as well?

Yeah! done here: https://github.com/lightningnetwork/lnd/pull/7788

lightninglabs-deploy commented 1 year ago

@roasbeef: review reminder @positiveblue: review reminder

Roasbeef commented 1 year ago

So im trying to determine if we should do a whole thing to make the options mappable or if we should rather just add extra documentation around the functional options saying "This one wont apply for GetBlock" or w/e. And then the other option is to break the API completely and have a new set of functional options.

So I think we may be able to unify things by inserting a type param here: there'd be a set of "static" values (say the timeout, etc), then a value T that's inserted to allow the specific package/call site to insert the extra information needed for the error. So something along the lines of: https://golang.design/research/generic-option/

I don't think this should block the rest of this PR series though, as upfront we've been able to identify some option abstraction leaks while ensuring the existing behavior is properly capture as is.

Roasbeef commented 1 year ago

So does this then mean we add a WithStaticQueryTimout option to the query dispatcher and lose the nice dynamic timeout feature there?

Agreed that in most cases we do want a dynamic timeout vs a static one. In the past, we had some hard coded timeouts tuned by devs in SF, then those in South America needed to retune things as certain timeouts were no longer valid across the higher latency environment (was mobile also).

lightninglabs / neutrino

query+neutrino: use query dispatcher for GetBlock and GetCFilter #273