Use master branch for development

near / nearcore

Reference client for NEAR Protocol

https://near.org

GNU General Public License v3.0

2.32k stars 623 forks source link

Use master branch for development #2165

Closed frol closed 4 years ago

frol commented 4 years ago

Currently, we keep master as our stable branch, so a user who wants to play with nearcore node, can join testnet without a need to learn about other branches. However, we have been very aggressive about evolving our protocol (i.e. adding new features and experiments, which touched the whole codebase, e.g. Nightshade, DoomSlug), and obviously, these efforts were not completed in a single day, so naturally, git conflicts would occur, so we decided to use staging branch, where we would merge code that worked good enough for development but had naturally been lacking backward compatibility and stress-testing. Thus, we ended up with the process of forking from staging and releasing to master from staging once in a while.

We had never intended to stick to this process forever, and as we are working on the proper release process right now (https://github.com/nearprotocol/nearcore/projects/3), we should eventually arrive into the state where we can fork from master and create separate branches for our releases (nightly / beta / stable).

NOTE: The current story around master and staging has already confused our external contributors: https://github.com/nearprotocol/nearcore/pull/2153, so we should not let this happen in the future.

damons commented 4 years ago

I agree that using master for stable releases is awkward. Personally, when I visit any given open source github repo, I expect master to be the bleeding edge of development. One of the first things I look for is recent activity. I want to know if the project is active or not. If I see that there hasn't been a commit in a few months, I immediately become wary that the project is not actively supported or developed.

I support the idea of using master as the place for active development and creating separate branches for our releases as nightly, beta, and stable. Production releases should also likely have specific branches and tags which allow anyone to check out a specific product release version (same for nightly and beta, too).

Obviously, we should strive for stable releases to not include regressions or backwards incompatibilities. Master should attempt to adhere to this policy as much as possible. Using master for active development will place the right kind of pressure on contributors to not break things simply because master will likely be the first thing anyone checks out.

I'd like to see the conversation here expand to include the naming convention we'd like to use for these branches and releases.

damons commented 4 years ago

Bo, Bowen, and I have been working through our release process which, today, is essentially rebuilding and lightly testing staging then merging it into master. In our live/offline discussions, happening in person, we are proposing the following:

For the next release (i.e., next week), we propose switching tip development (i.e., the default branch of nearcore) to be master and actually use master in that way.

Today, the default branch in github is indeed master, but any developer wanting to do active work that is not ready for testnet/beta/production will switch to the staging branch after doing a checkout. Once they are ready to check in/merge, their changes are checked into staging where we ideally would be testing everything, and then every once in a long while we merge those changes into master. This process is what will change.

Before this can happen, we will need to wait until master is stable using our latest code from staging. Once this happens, we will then create beta and stable branches. We will populate those branches with our master branch, therefore starting fresh.

There are several steps to make this real. Here's just a few off the top of my head:

Documentation on how everyone test drives NEAR must be updated for nearprotocol/nearcore to use the stable branch which will always work with a default network that has been tested.
Create a new network called devnet for development efforts. This will have a lower stability requirement than our current testnet. This will be the default network for the master branch.
We begin producing weekly beta builds (note that we change the name from staging to beta). We should add a new network called 'betanet'.
The name for the stable branch network will eventually be mainnet once we reach that level of stability; however, today, until mainnet is ready for launch, since our users are used to pointing at testnet, we propose we keep using testnet until mainnet is ready for production. We shouldn't claim stability for a network that isn't stable yet.
Ideally, when a NEAR server is started, it will check to see which branch it was built from and automatically select the correct network to use. However, this may require us updating our scripts and configuration tools to specify a network on the CLI and/or automatically configure multiple networks in the ~/.near directory. This should likely discussed in another issue, but it should happen quickly to make sure the proper networks are used out of the box after checkout.

vgrichina commented 4 years ago

As I understand there is some applayer part of transition:

Deploy new networks including corresponding web stuff (this might be good time to move all config to render.yml)
Have corresponding ci- networks for running integration tests
Update default configs in create-near-app and near-shell

@ailisp wonder if you want to look into these? Might be good idea to have someone other than me do it to increase the bus factor.

ilblackdragon commented 4 years ago

So here are few questions + thoughts to this and #2174:

Do we want to allow people who check out master to join TestNet / MainNet?
- If yes, there are two options:
  - (1) master is the most recent stable / MainNet release. And we only update it when we hard fork our network (current approach). Merging to master is major release. This should be tested ~14 days of replaying transactions, bombarding, security reviews, etc.
  - (2) master has latest code, but we maintain backward compatibility. Meaning we have CI that spins up new node, connects to the stable TestNet (that replicates MainNet), replays the chain and runs suite of tests. Without passing this, we can not merge into master. This should probably take ~1 day at least of testing.
- (3) If no, we can have bleeding edge in master with less strict rules on merging. We can ban running start_mainnet.py and start_testnet.py in the master branch (they bork and say "You are on a wrong branch, switch to blah branch"). Note, that if we don't do that, the risk is that someone will run slightly different protocol that will get them slashed on MainNet. Moving from master to beta can take ~1 day of testing toward beta network and ~14 days of testing on beta network replaying production and stress testing loads to stable.

Previously we were targeting to go with (1) I think @damons you suggesting to go with (3) -- I don't mind just want to make sure we understand all things coming with it.

E.g. we need to make sure then that even if someone tries to run binary from master toward stable / and later MainNet - it should always fail with helpful error.

Toward #2174: a lot of "features" will be hard to disable with feature flags.

For example state without trie #2050 - even with somewhat scoped change, it still changes some APIs across other usages. Or even something simple as 170 lines biasable randomness PR #2177 is might be possible but would require flags on the data structure to ignore new field (because otherwise that's a hard fork of a feature) and 16 files of flags with prob multiple places in each file. I'm not even talking about the unbiasable randomness PR that is 1.5k+ lines across probably all chain components.

We have discussed upgradability as well, and @nearmax was suggesting to use different versions of the same cargo package in the same binary and change which one is used via runtime. I'm not 100% sure how that would work, but we can explore that as an alternative to feature flags and also allow us to upgrade network without downtime.

@damons In your latest table I'm not 100% sure if master is latest pushed code or latest "beta" and dev is latest pushed code?

Not exactly sure from your comment also the flow from branch to branch, I was expecting something like this:

master - latest pushed code, can contain hard fork from current staging and stable, NEVER runs with TestNet and MainNet networks (refuses to do so to prevent accidental slashing and network splits). Runs suite of tests in background after PR submitted, and if there are regressions - we rollback that PR / PRs. Suite of tests, include spinning up large network of nodes, replaying production load of transactions and also stress testing with set of contracts & transactions.
every week we push upgrade latest fully tested master to staging network, which is our TestNet. This is binary that runs with TestNet. This has people testing it continuously: validators, developers, us, both because this is "free" network and all of our tooling supports it well. Additional testing / security reviews happen at this point as well. Tooling updates also "stage" here and test that everything works.
every month we have release into stable, which is MainNet release. This includes updates recorded in the spec release 1 month prior and any other relevant bug fixes and changes.
additional hot fixes and security fixes can be fast tracked into stable, but require additional process of approving & defining how that will be tested.

lexfrl commented 4 years ago

(3) If no, we can have bleeding edge in master with less strict rules on merging. We can ban running start_mainnet.py and start_testnet.py in the master branch (they bork and say "You are on a wrong branch, switch to blah branch").

To prevent accidents we can introduce protocol_version field in the genesis file (and if some feature changes protocol, protocol_version must be incremented). On a network level we have a check already witch prevents nodes with a different genesis to connect to each other.

SkidanovAlex commented 4 years ago

@nearmax how hard would it be to make macro magic that has annotations like this:

struct Version(u64); // could be something more sophisticated, as long as it can be compared

#[derive(BorshSerialize, BorshDeserialize)]
struct Approval {
    pub parent_hash: CryptoHash,
    pub reference_hash: Option<CryptoHash>,

    #[borsh_version_ge(8)] // <-- *something like this*
    pub some_new_field: CryptoHash,
}

And then have serialize / deserialize take a version as an argument?

ailisp commented 4 years ago

As I understand there is some applayer part of transition:
* Deploy new networks including corresponding web stuff (this might be good time to move all config to `render.yml`)

* Have corresponding `ci-` networks for running integration tests

* Update default configs in create-near-app and near-shell
@ailisp wonder if you want to look into these? Might be good idea to have someone other than me do it to increase the bus factor.

Sure, after switch to master for development I can help setup these

ailisp commented 4 years ago

@ilblackdragon

Runs suite of tests in background after PR submitted, and if there are regressions - we rollback that PR / PRs

Current issue: we have many failed nightly tests, I think need them all green for a few days to consider both code and tests stable enough, otherwise too many rollback Potential issue: rollback more than one PR possibly block several team members, I'd argue it's better to wait background test of previous merge success, before merge another PR

ailisp commented 4 years ago

(3) If no, we can have bleeding edge in master with less strict rules on merging. We can ban running start_mainnet.py and start_testnet.py in the master branch (they bork and say "You are on a wrong branch, switch to blah branch").

To prevent accidents we can introduce protocol_version field in the genesis file (and if some feature changes protocol, protocol_version must be incremented). On a network level we have a check already witch prevents nodes with a different genesis to connect to each other.

Good point. I think we have protocol_version, but didn't increase it for a while

damons commented 4 years ago

(3) If no, we can have bleeding edge in master with less strict rules on merging. We can ban running start_mainnet.py and start_testnet.py in the master branch (they bork and say "You are on a wrong branch, switch to blah branch"). Note, that if we don't do that, the risk is that someone will run slightly different protocol that will get them slashed on MainNet. Moving from master to beta can take ~1 day of testing toward beta network and ~14 days of testing on beta network replaying production and stress testing loads to stable.

Previously we were targeting to go with (1) I think @damons you suggesting to go with (3) -- I don't mind just want to make sure we understand all things coming with it.

Yes. (3) is what I'm proposing.

@damons In your latest table I'm not 100% sure if master is latest pushed code or latest "beta" and dev is latest pushed code?

Not exactly sure from your comment also the flow from branch to branch, I was expecting something like this:

master - latest pushed code, can contain hard fork from current staging and stable, NEVER runs with TestNet and MainNet networks (refuses to do so to prevent accidental slashing and network splits). Runs suite of tests in background after PR submitted, and if there are regressions - we rollback that PR / PRs. Suite of tests, include spinning up large network of nodes, replaying production load of transactions and also stress testing with set of contracts & transactions.

every week we push upgrade latest fully tested master to staging network, which is our TestNet. This is binary that runs with TestNet. This has people testing it continuously: validators, developers, us, both because this is "free" network and all of our tooling supports it well. Additional testing / security reviews happen at this point as well. Tooling updates also "stage" here and test that everything works.

every month we have release into stable, which is MainNet release. This includes updates recorded in the spec release 1 month prior and any other relevant bug fixes and changes.

additional hot fixes and security fixes can be fast tracked into stable, but require additional process of approving & defining how that will be tested.

Yes, with a few slight modifications:

master is lastest pushed code. We're creating a new network for development work: devnet. Should be done by EOD tomorrow (Wed. Feb 26th).

beta branch is code promoted weekly (Wednesdays) from master. We are creating a betanetfor this branch. We will no longer use staging_testnet. EOD tomorrow.

stable branch is code promoted monthly from beta branch. Stable uses testnet until mainnet is launched. Stable will then begin using mainnet after launch.

Documentation and scripts are being updated so that anyone who wants to run a node will receive instructions to checkout the stable branch first before doing a build or running a node. Today, the documentation instructs anyone wanting to run a node to checkout the staging branch first (which often fails today, See: #2044 ). We are updating the documentation to instruct them to use stable instead, which will point to testnet per above. See: #2192.

ailisp commented 4 years ago

master is lastest pushed code. We're creating a new network for development work: devnet. Should be done by EOD tomorrow (Wed. Feb 26th). beta branch is code promoted weekly (Wednesdays) from master. We are creating a betanet for this branch. We will no longer use staging_testnet. EOD tomorrow. stable branch is code promoted monthly from beta branch. Stable uses testnet until mainnet is launched. Stable will then begin using mainnet after launch.

Sounds good to me. We have stable&master branch now. Tomorrow will create beta branch and first release from that :)

bowenwang1996 commented 4 years ago

Beta branch is already created by the way.

ilblackdragon commented 4 years ago

@ailisp @bowenwang1996 we just had discussion about this (see more details https://commonwealth.im/near/proposal/discussion/360-consistency-between-downloaded-state-github-branches-and-the-actual-networks) and there are few questions that popped up:

How actual release is happening, specifically steps to go from PR that merges into beta, to state dump, to updated genesis state to everyone in the network upgraded to the new code and new participants join the same network without issues. Same for stable.
- How are this release qualified. Specifically, before merge into beta is qualified, we should run all "nightly" tests and they should pass. If there is regression, we should be fall back into latest stable "nightly" commit.
- But this also includes how we are going to reset state from active network and restart all the nodes at the same time.

Didn't find where this is spec-ed out.

frol commented 4 years ago

@ilbackdragon these are essential questions, but I feel that they do not belong to this particular issue. Let’s have another tracking issue for the raised concerns

bowenwang1996 commented 4 years ago

@ilblackdragon I thought we will run devnet and betanet internally. If we allow external people to join it is much harder to automate things. For testnet release yes there needs to be some coordination. I somehow forgot about this and will think more about it. For how releases qualify, please checkout https://commonwealth.im/near/proposal/discussion/344-release-process

ilblackdragon commented 4 years ago

Closing this issue as @frol is right. Let's keep discussion to commonwealth.