Open Sarah-NEAR opened 1 year ago
2023-11-27
2023-12-19
2024-01-26
First release will work for current state of mainnet
where nodes track all shards.
As mentioned in the previous update, we had no active progress on this project this week.
Over the past week we focused on enabling shingle shard tracking via GCS state sync and DSS work was postponed.
We are now focusing on setting up a new connection to a peer for each part we need to download. We are also working on optimizing the state snapshot to reduce the strain on the memory due to compaction.
The rest of the plan until release includes:
Estimated effort is 1 month so we continue planning to releaser it in 1.38.
Over the past week we decided that the node will manage the state sync connections in a new lightweight pool. They will be short lived connections and they will be closed after every part exchanged. On the requesting side, we will use raw connections without a handshake and the first message will be a state sync heard/part request.
We will increase the priority of the fixing state snapshots, to make sure it lands in time to be shipped with DSS.
We encountered an additional effort to implement the state sync peer connection manager, and this will add two additional weeks to the timeline, translating into an estimated production date of end of March. The extended timeline may push this beyond the 1.38 estimated date.
The rest of the plan remains the same:
During the previous week, our focus was on refining the connection between peers for requesting and serving parts. However, this work is still in progress as I was on call for the week.
While we haven't made any advancements in resolving the state snapshot compaction issue, we did identify another problem with the existing implementation. If compaction is enabled on the node, it will crash and fail to clear the generated state parts for the last epoch. Consequently, this leads to a gradual reduction in available storage space on the data partition over time.
As mentioned last week, we encountered an additional effort to implement the state sync peer connection manager (we initially believed that the raw connections are managed in a diferent pool but this is not true. In this case, we need to handle the lifetime of the connection on the serving side as well by implementing the TIER3 pool for short lived connections), and based on alignment with Saketh this will add two additional weeks to the timeline, translating into an estimated ready for production date of end of March. We are looking into the 1.38 release timeline to see what options we have, and we plan allocate more engineering bandwidth in March to reduce the additional time.
The remaining work items are:
@VanBarbascu will this be included in the 1.38 release or not?
2024-03-03
Last week we completed implementation of the connection handling on the server side.
This week we plan to implement rate limiting for incoming connections, and establishing connections to request parts from peers. Additionally, we will begin work on the fix for memory leak that occurs during state snapshots.
DSS project timeline remains unchanged compared to last week: code complete and testing done by the end of March. Since the nearcore release schedule is moving back to strict timelines, we will miss 1.38 and we plan to release DSS on mainnet with version 1.39 (branch cut 2024-04-15).
Remaining work includes:
2024-03-11
Refactoring of the state sync components is currently in progress. We're in the process of integrating the new peer selection mechanism, which relies on state snapshot host gossip. Regarding the status of the state snapshot fix, it's being actively worked on.
The focus on (DSS) will be reduced for this week until we roll out resharing on the mainnet. As a result, the timeline is expected to be extended by approximately 1.5 weeks, pushing the estimated completion date to the first week of April.
Remaining work includes:
2024-04-01
We fixed the state snapshot size bug and now the snapshot is no longer prohibitively expensive to keep. Due to the release schedule changes, we reduced the focus on DSS and shifted toward addressing mainnet congestion.
We will resume work on DSS next week with the estimate completion date of end of April.
We will resume work on DSS next week with the estimate completion date of end of April.
Is this still accurate?
Related issue: #12004 Project doc: LINK
Goals
Background
State Sync equips validator nodes with State data they need in order to produce blocks. Without it, nodes need to get the state data from outside the chain (e.g. from an S3 snapshot) and constantly spend effort to keep the state up to date with the chain. Decentralised State Sync is the second part of the effort, building a data sharing overlay between the network nodes. It provides a scalable and decentralised way of transferring state parts between nodes.
Why should NEAR One work on this
State Sync unblocks two features:
What needs to be accomplished
Main use case
Links to external documentations and discussions
Additional resources will be added here when they become available.
Estimated effort
Engineers assigned: @VanBarbascu, @marcelo-gonzalez and @saketh-are.
Initial effort estimate is about 6-8 PM (person months). Currently remaining effort is presented in the latest comment of this issue.
Assumptions
There are no specific assumptions that this project is making.
Pre-requisites
N/A
Out of scope
N/A
Task list: