hyperledger / besu

An enterprise-grade Java-based, Apache 2.0 licensed Ethereum client https://wiki.hyperledger.org/display/besu
https://www.hyperledger.org/projects/besu
Apache License 2.0
1.49k stars 815 forks source link

7311: Prevent PendingPeerRequest from switching to a new peer unless no peer is specified #7531

Closed Matilda-Clerke closed 1 month ago

Matilda-Clerke commented 1 month ago

PR description

Prevents PendingPeerRequest from switching to a new peer, allowing this to be controlled (and appropriately monitored) by higher level code

Fixed Issue(s)

https://github.com/hyperledger/besu/issues/7311

Matilda-Clerke commented 1 month ago

This change will have the affect that some tasks which would previously automatically switch to a new peer, if the supplied peer is disconnected, to instead fail with a PeerDisconnectedException.

siladu commented 1 month ago

This change will have the affect that some tasks which would previously automatically switch to a new peer, if the supplied peer is disconnected, to instead fail with a PeerDisconnectedException.

Is the idea to create a subsequent PR to handle this exception and "switch" the peer a higher level, or is the code already handling this?

If more code needs to be written, any reason the higher-up handling can't be merged in first? My concern is we could end up with a stuck pipeline, but I don't know this code too well so will follow your or @pinges lead :)

siladu commented 1 month ago

From the issue description, I was expecting to see something about "too many ongoing requests" - is that encapsulated in the "isDisconnected" check or going to be handled separately to this PR?

Matilda-Clerke commented 1 month ago

Is the idea to create a subsequent PR to handle this exception and "switch" the peer a higher level, or is the code already handling this?

For the most part, the code already handles this so long as the task extends AbstractRetryingSwitchingPeerTask or similar. There are several that don't, and I'm not sure if they should or if they should just be left to fail. The code is very complicated, with 4+ layers of inheritance mixing together peer selection, task execution, and retrying. It's a bit of a mess. Ideally, it would be great to refactor this into more clearly defined modules that can then work together (e.g. one for task execution, one for peer selection, etc), but that's a big job, and one I don't feel I understand the existing code well enough to undertake yet. I'm getting the feeling maybe no one really understands this code well enough to confidently refactor or rebuild.

From the issue description, I was expecting to see something about "too many ongoing requests" - is that encapsulated in the "isDisconnected" check or going to be handled separately to this PR?

This is actually already handled. In the case that the selected peer has too many requests currently, the task is put into a Collection for later retrying.

siladu commented 1 month ago

There are several that don't, and I'm not sure if they should or if they should just be left to fail.

That sounds a bit dangerous, and it's hard to know you've covered all scenarios (including different types of network) without extensive testing. Maybe there's a way to feature toggle it?

Given the central location of the change, and the recent instability we've had around peering, I'd err on the side of not merging into main until the impact is fully understood and/or thoroughly tested.

Maybe @pinges can help design a solution without the need for a large refactor.

Matilda-Clerke commented 1 month ago

After investigating and building this class diagram to better understand the structure of the code, I decided to also modify AbstractPeerRequestTask to retry with a fresh peer when the peer is disconnected.

Matilda-Clerke commented 1 month ago

Latest changes are causing a failing unit test which I can't find a way to fix without substantial refactoring. For now, I'm focusing on writing up a full analysis of the current code, and my ideas for how we might refactor or replace the code to improve readability, modularity, testability, and modifiability.