near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.32k stars 623 forks source link

Patch state non-deterministically failing on underpowered machines #8328

Open ChaoticTempest opened 1 year ago

ChaoticTempest commented 1 year ago

Describe the bug Patch state for sandbox is failing when being ran concurrently with a large number of sandbox nodes. The following error message is presented when the test fails. The test was patching the registrar account from testnet into the sandbox node.

Error: Failed to query access key: handler error: [Access key for public key ed25519:5WMgq6gKZbAr7xBZmXJHjnj4C3UZkNJ4F5odisUBFcRh has never been observed on the node]

This seems to happen on machines that are under powered such as a CI pipeline. Running the same tests locally (on a macbook M1) works fine.

This issue seems related to sharding, as the code for patching state seems to have been moved with it when it was added. Also with the relevant code bit here being a candidate suspect:

https://github.com/near/nearcore/blob/52087fb24c8211f8c8770f31b6e3e427b7228799/chain/chain/src/chain.rs#L3738-L3740

To Reproduce This is a bit hard to reproduce as we need a somewhat underpowered machine running a lot of tests at once. This was first noticed while being ran with PR tests on aurora-is-near/aurora-eth-connector. They have at least 36 tests with each of the tests spinning up their own separate sandbox node, calling into patch state at least once. Most of the tests do pass, but like 1 or 2 sometimes fail. It feels like there's data contention somewhere related to patching state.

Expected behavior All tests pass as normal.

Version (please complete the following information):

Additional context First reported in https://github.com/near/workspaces-rs/issues/253

frol commented 1 year ago

@ChaoticTempest It seems that fast-forwarding can get stuck: https://github.com/near/workspaces-rs/issues/266#issuecomment-1612735281, do you think it can be related?