near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.31k stars 613 forks source link

🔷 [ProjectTracking] Forknet improvements #10542

Open posvyatokum opened 7 months ago

posvyatokum commented 7 months ago

Goals

The toolbox of infrastructures we have for creating and managing test mock networks is large and versatile, with Forknet and mirror tool standing out as being the most powerful. The goal of this project is to continue the development of Forknet by unifying it with the mirror test infrastructure. The two have a lot in common, but some parts (e.g. control plane) are done differently due to the different design philosophies. By merging them together we will simplify toolbox and will better utilise it.

Background

We had several tests (regular betanet, and on-demand spoon test) that were creating mocknet with simple predefined traffic. We wanted to have a way to make traffic more meaningful, so we developed a mirror toolset – a way to test binary (or binary release) via mocknet with a real traffic slice from mainnet. This tool was utilized only during releases, and only by the Node team. And it was not very easy to do.

Independently, we also developed a toolbox to speed up creation of mainnet forks for testing (Forknet). Now, we want to combine fast test setup with working traffic mirroring, to create a better testing system. In order for it to be as useful as it can be, we are also focusing on user-friendliness.

Context

The project is split into three pillars: Correctness, Performance and Infrastructure.

Why should NEAR One work on this

Short term value:

Medium term value:

Long term value:

What needs to be accomplished

Correctness goal:

How we will do that:

Performance goals:

How we will do that:

Infrastructure goals:

How we will do that:

Links to external documentations and discussions

Assumptions

N/A

Pre-requisites

N/A

Out of scope

Custom test flow development is out of scope. If some feature need specific network orchestration to be properly tested, we expect the feature engineer to write the orchestration script using provided tools and examples.

Task list:

Roadmap

### Correctness
- [ ] Reliable Forknet v2 image creation using tools/fork-network
- [ ] Support use of RPC, legacy archival, and split storage archival nodes in the Forknet
- [ ] Ability to check that the transactions made it on chain with the desired outcome
### Performance
- [ ] Generate the desired TPS on forknet by mirroring transactions
- [ ] Allow additional locust traffic
- [ ] Monitoring for the issued traffic
### Infrastructure
- [ ] Simplify the network initalization
- [ ] Write guidelines for new test flow creation
- [ ] Add configurable Grafana dashboard creation to the test flow
- [ ] Create a script to compile Grafana dashboards and alerts into a test report
- [ ] Allow Nayduck tests to interact with an existing Forknet

Real life progress

### Tasks
- [ ] #10581
- [ ] https://github.com/near/nearcore/issues/10642
- [ ] https://github.com/near/nearcore/issues/10922
- [ ] https://github.com/near/nearcore/issues/10957
- [ ] https://github.com/near/nearcore/issues/10959
- [ ] https://github.com/near/nearcore/issues/11086
### Bugs
- [ ] https://github.com/near/nearcore/issues/11694
- [ ] https://github.com/near/nearcore/issues/11730
### Backlog
- [ ] [Mirror] Add documentation on traffic mirroing
posvyatokum commented 7 months ago

2024-01-31 Meeting notes CC: @marcelo-gonzalez @gmilescu @posvyatokum

We are again focusing our attention on the forknet. Forknet MVP would support:

We have developed two toolboxes for mocknet: Marcelo's mirror tools, and Vlad's forknet tools. We need to combine them into a working solution that hides away most of the complexity from the users, while providing enough flexibility.

To achieve this we will:

For a feature developer the flow of using forknet will look like this:

CI testing using forknet will use the same flow automatically. New traffic slice image will be created monthly. New forknet instance will be created/destroyed weekly. Flow of the test will simply start every node with the latest master binary.

Our first goal is a stable setup for forknet CI with one of existing traffic slices.

posvyatokum commented 7 months ago

2024-02-07 Meeting notes CC: @marcelo-gonzalez @gmilescu @posvyatokum The first goal for the project is to test resharding on mocknet with split storage nodes. #10581 This effort will allow us to add split storage nodes to any testing setup in the future. We are allowing ourselves to build on top of the established mirror infrastructure, as it is fully working at the moment. The downside of it is non-optimal performance that leads to long waiting times between starting to set up the test and the test start itself. Right now it makes sense for us to gradually adjust mirror infra to use improved tools (like https://github.com/near/nearcore/tree/master/tools/fork-network), rather than build new testing infrastructure from ground up. In the end we are aiming to have fully optimized performance, without ever losing the ability to do a complete test in the process.

posvyatokum commented 6 months ago

2024-02-14 Meeting notes CC: @marcelo-gonzalez @gmilescu @posvyatokum We are still working on supporting split storage, and have made some progress in the issue #10581. At the same time we took time to create a better roadmap for the project, and agree on the order of delivering features. For people with access, full doc can be found in google drive.

Important previews:

Conclusions

As we limit the scope for the continuation of a forknet project, we hope to have a mocknet test that is:

We are actively trying to learn from our mistakes and avoid overengineering. We do not want to improve non-crucial tools, or tools that are not causing significant problems.

Our core values for this project can be summarised in two statements:

TLDR table

image

Roadmap table

image

Future plans

@posvyatokum will focus on switching to forknet approach in mirror test setup @marcelo-gonzalez will focus on supporting all types of nodes in mirror test

posvyatokum commented 6 months ago

2024-02-21 Meeting notes CC: @marcelo-gonzalez @gmilescu @posvyatokum We are focusing on pre-release testing of 1.37.0 #10642. We will use results from #10581 to be sure that all mainnet nodes will be able to go through resharding without problems. This will close one of our 3 goals for the short term of this project:

Build confidence in resharding on mainnet

posvyatokum commented 6 months ago

2024-02-28 Update

CC: @marcelo-gonzalez @gmilescu @posvyatokum

Past week

For the past week we were focusing on testing the 1.37.0 release. We were actively developing tools for mocknet management:

@marcelo-gonzalez discovered problems with resharding, using the restarting tool. @posvyatokum created pseudo-archival dbs for mirror testing.

Next week

@marcelo-gonzalez will focus on testing the bug fixes for the resharding issue. @posvyatokum will continue working on a realistic node setup MVP. This includes:

Our first priority is to thoroughly test resharding before mainnet release. We will gear towards incorporating forknet tools whenever possible, if it fits our timeline. We will keep documentation of all steps taken for easy test reproduction, and future automatisation.

Progress overview

These tasks contribute to all Stage 1 goals:

At the end of the week we aim to complete 1.37.0 testing, and have an updated set of instructions for 1.38 testing. Testing improvements will come after we have transitioned to 5 shards on mainnet.

posvyatokum commented 6 months ago

2024-03-04 Update

CC: @gmilescu @marcelo-gonzalez @posvyatokum

Past week

For the past week we were focusing on testing the 1.37.0 release. @marcelo-gonzalez made sure that issues with resharding after node restart are fixed @posvyatokum made sure that resharding works on split storage nodes

Next week

@marcelo-gonzalez will focus on helping @VanBarbascu to test 1.38.0-rc.1 and create a comprehensive documentation for the process @posvyatokum will create a permanent mocknet for feature testing. This will make mocknet feature testing easier and faster for developers and decrease debugging time during the release testing.

Progress overview

We achieved the goal of increasing confidence in the 1.37.0 release. @posvyatokum is in the process of making mocknet testing more accessible to the engineers @marcelo-gonzalez is in the process of creating a clearly established process of pre-release mocknet testing

RoadMap adjustment

@khorolets raised a point of developing easy test result evaluation methods. Right now we have all of the work regarding automatic test evaluation planned for Stage 3. Looking back, this seems like a very narrow timeline for an important feature. @posvyatokum will rethink the roadmap for test evaluation, with a focus on some POC in Stage 1. Accuracy of POC solution is out of the scope for Stage 1, we just need to enable developers to have some form of evaluation automation.

posvyatokum commented 5 months ago

2024-03-11 Update

CC: @marcelo-gonzalez @gmilescu @posvyatokum

Past week

@marcelo-gonzalez tested resharding of shard 2 in 1.38 release

Next week

@marcelo-gonzalez will continue to test and support 1.38 resharding release @posvyatokum will hold a protocol discussion about forknet testing for developers. The goal of discussion is to collect feedback and feature requests from engineers to adjust the project roadmap. @posvyatokum will create a new draft of forknet testing instructions before protocol discussion. It will not contain commands that need to be executed, as it is a subject to change, but will rather describe the process that engineers may go through when testing their feature. This document should help us have a more productive protocol discussion.

Progress overview

posvyatokum commented 5 months ago

2023-03-18 Update

CC: @marcelo-gonzalez @gmilescu @posvyatokum

Due to the high density of complicated releases, we don't have significant progress for this project. Everything from the past update plans transfers to this week.

posvyatokum commented 5 months ago

2024-04-01 Stage 1 Update

CC: @marcelo-gonzalez @gmilescu @posvyatokum

Context

At the start of the project, we broke it down into three stages of increasing length. That allowed us to have concrete long term goals, while not over-focusing on implementing the final product right away. As a result, we prioritized immediate needs, and right now we feel like it is the right approach to this project. Thus, we will again restructure our roadmap in a way that creates some milestones and carves out some vision of the final product, but only gives full definition to the tasks for the next month.

Expectations for Stage 1 (March)

Goals

Tasks

Not done:

Ad-hoc implementation:

Conclusion

We focused on building tools that we needed in the moment, and didn't prioritize polishing them and making them a part of an established flow. This was mainly due to an extreme workload of the pre-release testing process, that required us to move fast and not break things.

Expectation for April

We see that we were able to successfully manually incorporate different new tools into the established mocknet flow. Now we need to make them a permanent part of the process. We will focus on creating an end-to-end MVP product tailored to testing transition from stateful to stateless validation. We will decide on a concrete roadmap in the next Forknet meeting (planned for April 2nd). We should be mindful about distinguishing MVP for the whole project, and an MVP solution for this particular case, as the full solution requires a lot more automation and areas of freedom for feature developers.

TLDR

In March we did a lot of ad-hoc things to support releases, in April we will create a mocknet CI for stateless validation.

marcelo-gonzalez commented 4 months ago

https://github.com/near/nearcore/pull/11034 implements the speedup of network startup. after this PR, we can make things faster by changing the way we setup the images so that in ~/.near/setup/ there's a full NEAR home dir that has had neard fork-network init and neard fork-network amend-access-keys run on it. The scripts should work the same way with no big difference to the interface