Light Client Refactor Review

ebuchman commented 4 years ago

Did a full review of the new light-client code. It looks really great, especially all the linking to the spec, the separation of components, the clarity of the sequential list of predicates, and the overall completeness of everything. That said I have some concerns and a variety of feedback, which I'll just braindump here, roughly by section/concern.

And apologies for having not been able to review more frequently. I've gotten into a habit recently of dropping large reviews all at once after months of work, which probably isn't the best approach. I'm going to try and fix that moving forward.

ADR

At a meta/process level, we probably shouldn't have done this without a more complete ADR. That's my bad, I explicitly dismissed it at the time, but I think I was anxious about making progress on this and we seem to have underestimated how long this work would take (classic). It definitely would have helped to have it during the review and to codify why we did this refactor and how it compares to the previous versions. Some of the below concerns may have been properly addressed in an ADR, including a possible iterative plan to address them. It might even be worth still writing one.

Coupling to Tendermint types

The new crate seems heavily coupled to the tendermint crate. One of the original design goals was for the light-client logic to be decoupled from the tendermint types, and the previous version pretty much achieved this via the header/commit/validator etc traits. Is there some plan for how to address this?

My understanding was that the predicates would be closures, and would close over the specific types, so they wouldn't need to take those concrete types as arguments. Is that still the plan?

This decoupling isn't critical in the short term, and right now we need to focus on shipping this rather than more refactors, but I still think it matters, so we should have a plan for addressing down the road.

The traits in operations expose more or less the same set of methods that were exposed by the traits in the prior version, which suggests they'd need to be mocked out similarly, but my understand was that we wanted to get away from mocking data like that. Iirc, a big motivation for the refactor was supposed to be for improved unit tests by reducing the need to mock these kinds of traits. Meanwhile, we don't have any unit tests yet, which might be a bit concerning. My impression is that given the amount of coupling, unit testing eg. the Verifier::verify method will require a lot of mocking, similar to the previous version of the code ...

Other High Level

We're using a lot of use::* - we should probably prefer explicit imports.

We should create issues for the TODO and FIXME in the code.

Bisection Tracing

Can we find a better name for trace_block? We're not really tracing here, we are marking something is on a trace. At least a better comment where it's called
Can we better document who get_trace is for (ie. IBC)

Store

What's with the secondary indices for the verification status? Don't we only have one header per height at a time?
If not, memory store doesnt seem to support more than one header per height?
In any case can we better document the need
What's the plan for implementing latest so we don't iterate over everything (ie. see the FIXME)

Light Client

In what case might get_or_fetch return an already verified header?
"If the trusted state is now at the height greater or equal to the target height," - how could we ever get to something greater than the target height?

Operations

Looks like we copied a lot of stuff in from the tendermint crate. Presumably these will get deduplicated?
voting_power_in: this function was already broken (#282), but in any case we may want/need two versions of it, one for when the validator set matches the commit (ie. when we're verifying +2/3 signatures for a block) and one for when we're checking the overlap between the signatures and our last trusted validators. This may become necessary especially if the ValidatorAddress is removed from the CommitSig, which is being contemplated upstream. Then we'll need access to both the current and last trusted validator sets for the verify_overlap computation.
On another note, I don't think these functions have ever had tests, which shows because voting_power_in is broken lol. We should open an issue for that if we don't have already.

Predicates

is_within_trust_period:
- I don't think this needs to be called each time, only once at the start of a bisection trace (the other calls should be redundant since we can't expire in there). Not a problem, just a note.
- the header.time < now check seems to be duplicated
- the header.time <= now check should probably use header.time < now + delta, ie #137
header_matches_commit:
- there was a recent discussion around possibly removing the commit.BlockID field (along with a bunch of other redundant data in Tendermint commits/blocks), since it's technically redundant. If that happens, this check would become implementation dependent.
verify_overlap
- this should just check has_sufficient_validators_overlap but not has_sufficient_signers_overlap
- the former is the actual overlap check we need for bisection. the latter is the full verification for the commit
has_sufficient_voting_power
- this should be the full +2/3 verification that the validators correctly committed the block.
- I believe it should be what is currently has_sufficient_signers_overlap

Verifier

the verify method is beautiful :)
but verify_overlap should only be called in the non-sequential case, ie. when light_block.height() != trusted_state_next_height. Otherwise verification would fail if the validator set changes significantly from one block to the next and we'd never be able to make progress. We should definitely have a JSON test for this, I'm a bit surprised if we don't already ...

Errors

InsufficientValidatorsOverlap - should we include the fraction and the threshold we were looking for? Maybe the fraction can just be part of the string display, but the threshold might be good to include.

brapse commented 4 years ago

Thanks for taking the time to do a thorough review. I agree with most of the concerns here. A few comments at the meta level before jumping into specific points. Even if I consider the refactor successful, I think we could have scoped it tighter. This have gotten it merged faster and made it easier to describe in a ADR. We had some key goals that were addressed but a lot of the value that we got from this work was peripheral and not negotiated in the original scope. We can definitely be more diligent here in the future. With that said I’ll focus of the process/architecture concerns.

ADR

Agree here and as mentioned above this would have been helped by better scoping. We kind of abused ADR-006 here to capture the evolution of our thinking but I sense that a lot of details got lost in the quick iterations. The process was dynamic and aligned with the English specification. Most of this alignment was done via live code reviews in our weekly meeting. There were so many ideas flying around and being experimented with that it was difficult to capture intermediate thinking in writing. The speed of the process was valuable but admit that it would be better/more inclusive if we kept a better record of our thinking and decisions. I think you’re right that it is worth describing the conclusion of this work in one or perhaps multiple ADRs.

Coupling Of Tendermint Types

Yes the intention is for the light client to use its own types which only have the fields the light client needs. But those types will most likely be constructed from the original Tendermint types for simplicity (or have From trait implementations). We haven’t gotten there yet as we wanted to merge this as soon as possible.

What we did de-couple is the crypto functions from their types by isolating the dependencies in the operations module as you mention. The idea here is to update Tendermint crate such that none of our core types depend on complex crypto functions. It will also allow us to inject simplified implementation of these operations during testing. So most likely the operations stuff will move into the main crate and provide mock implementations for testing. This way the simplified core types will not need to be mocked out in test, only the operations will. The current location of this code is an intermediate step in that direction.

ebuchman commented 4 years ago

Also a note about the scheduler. It's a bit hard to read the bisection logic out of it.

Can we use single word variable names like trusted, next, and target to make this more readable in cases like this where it's obvious we're dealing with heights and no other types?
The conditional block is hard to read. Can we clarify the cases somehow by relying on relationships that should always hold when verifying into the future, like (?) :
- trusted <= next
- trusted <= target
- next <= target
If the above conditions are true, it's hard to understand that last branch of the conditional (it would never trigger?)
How to reconcile previous point with need to schedule into the past? Split the schedule function in two ?

romac commented 4 years ago

@ebuchman Thanks for thorough review :)

I have answered some of your questions and tracking progress on your comments in https://gist.github.com/romac/fe0415aba0397bf14912b505d5eba785.

brapse commented 4 years ago

Can we confirm that the major points have been either addressed in (#284 #302 ) or broken out into follow up issues?

brapse commented 4 years ago

Closing for now. Feel free to open if something was missed.

informalsystems / tendermint-rs