The Fate of Medalla

As discussed on our latest call, we need to decide what to do with Medalla.

Two primary options:

Let it die as we approach mainnet, replacing it with a new v1.0 testnet
Keep it running, and upgrade it to v1.0 release (BLSv4, discv5.1, maybe fork the state to "mainnet" v1.0 constants)

Arguments for (1)

With respect to validator breakdown, Medalla is primarily run by the community. As mainnet approaches, I expect most community members to have little appetite to consistently run two nodes, and thus the largely community composed testnet is in for a number of rough leak periods. Validators cannot be activated during a leak so this might diminish the quality of service the testnet provides for the community. (We are currently seeing such a leak and expect finality at end of October. That said, we could easily see another leak or two in the coming months).
Forking can be difficult. This is not only wrt the underlying technicals, but also wrt community/user coordination. Given mainnet is imminent, asking existing Medalla users to coordinate software upgrades might (1) distract from their mainnet preparations and (2) might result in a low percentage actually upgrading to the fork. This would lead to another leak. Additionally, there is technical overhead to upgrade each of the fork components. While good practice, it will take significant developer resources.
A clean reset when most of the community migrates to mainnet will give us a chance to run the majority of validators on the new network to provide a much higher quality of service to those that need a testnet for testing. Additionally, we can consider "testnet" features that might make keeping the quality of service higher in the first place. For example, we can have a set of master exit keys that are allowed to submit exits for any validator. We can then daily sweep validators that haven't been online in some time period and submit exits for them.

Arguments for (2)

We're going to have to do forking upgrades sooner rather than later so it could be good practice for devs and community. This is a practice in both technicals and coordination. The technical upgrade can either be just discv5.1 and BLSv4, or it can also include a migration to v1.0 configuration (e.g. alter some constants, expand some arrays in beacon state, etc). If we do this part of the upgrade, I would recommend doing a "re-genesis" at this fork state and not serve blocks/states prior to the fork. If we did that, it would mean that Medalla does not satisfy the following goal
Medalla has a bunch of syncing data so we can have a solid place to test more stressful syncing when mainnet goes live. It has been expressed by some devs that this is an asset to the development and release process and starting mainnet without the asset in place

My thoughts

I probably showed my bias in the arguments above. I personally lean toward not upgrading Medalla and letting it degrade in mid to late November.

In the next couple of months, there will be a high coordination event (mainnet genesis) going on that I think a testnet fork will distract from on both technical and community. Additionally, if we upgrade to v1.0 configuration, I would recommend doing a "regenesis" event with a single script migration rather than maintaining legacy states/code paths for the testnet. If we do that, it defeats the syncing data argument.

Instead, we could keep the Medalla config around and just upgrade discv5.1 and blsv4. blsv4 could be quietly upgraded anytime because 0 pubkey is not currently on that testnet, but this could change anytime. As for discv5.1 there are two options, (a) do a catdog style dual table or (b) a migration at a discrete epoch. (a) would help reduce the coordination overhead on the upgrade but would require a significant more amount of development and maintenance. (b) would require a single epoch of coordination and the majority of the network to upgrade their nodes.

My ideal: Starting in November, we should begin standing up some private v1.0 testnets to make sure everything plays nicely. If in mid-November, we have a net we are happy with, we can keep it running and open it up to the public to serve as a new public testnet (but with less fanfair pushing the community to join). Instead, just a static resource for when people need it. We could keep Medalla around until the end of the year to have a place to test longer syncs, but eventually let Medalla die in favor of the new testnet that would then have more than a month of sync data.

A lot of the decison on what to do is probably more of a dev resourcing standpoint. The various paths have different engineering/time requirements and also have implications for longer term maintenance.

Client teams care to chime in?

From Prysmatic Labs: we are leaning towards abandoning Medalla for a new testnet with v1.0.0 and dropping legacy pre-v1.0.0 code/params.

Arguments for (1)

With respect to validator breakdown, Medalla is primarily run by the community. As mainnet approaches, I expect most community members to have little appetite to consistently run two nodes, and thus the largely community composed testnet is in for a number of rough leak periods. Validators cannot be activated during a leak so this might diminish the quality of service the testnet provides for the community. (We are currently seeing such a leak and expect finality at end of October. That said, we could easily see another leak or two in the coming months).

Client teams are only running a small percentage. If the community is less interested in running validators, then client teams ought to run more validators. At Prysmatic Labs, we only run 1038 validators but would be willing to run a larger quantity of keys. Besides, leaks are good data to stress test clients and find bugs that would otherwise never be discovered in a testnet that only takes the happy path.

Forking can be difficult. This is not only wrt the underlying technicals, but also wrt community/user coordination. Given mainnet is imminent, asking existing Medalla users to coordinate software upgrades might (1) distract from their mainnet preparations and (2) might result in a low percentage actually upgrading to the fork. This would lead to another leak. Additionally, there is technical overhead to upgrade each of the fork components. While good practice, it will take significant developer resources.

Clients will need to support a forks. In fact, it has been brought up as a community concern that ETH2 does not know how to conduct a hard fork / upgrade. See the background section of this first draft fork proposal document: https://hackmd.prylabs.network/L_HINu8iRrOkwh7I2Zf15Q?view

A clean reset when most of the community migrates to mainnet will give us a chance to run the majority of validators on the new network to provide a much higher quality of service to those that need a testnet for testing. Additionally, we can consider "testnet" features that might make keeping the quality of service higher in the first place. For example, we can have a set of master exit keys that are allowed to submit exits for any validator. We can then daily sweep validators that haven't been online in some time period and submit exits for them.

We'd need some clarification on this idea. With some assumptions, here are some thoughts:

In my opinion, the test network should mimic the production environment. Functionality that supports master keys to submit exits for any validator has potential to leak into production and does not represent a mainnet environment. If Medalla finality stops due to lack of interest, the uninterested validators will be eventually kicked out such that the testnet regains finality. Then client teams can send as many deposits as they would like to run and therefore run a higher percentage of the network. In retrospect, client teams should have maintained a higher percentage of the network anyway. Sending more deposits as the network scaled up.

Arguments for (2)

We're going to have to do forking upgrades sooner rather than later so it could be good practice for devs and community. This is a practice in both technicals and coordination. The technical upgrade can either be just discv5.1 and BLSv4, or it can also include a migration to v1.0 configuration (e.g. alter some constants, expand some arrays in beacon state, etc). If we do this part of the upgrade, I would recommend doing a "re-genesis" at this fork state and not serve blocks/states prior to the fork. If we did that, it would mean that Medalla does not satisfy the following goal

Prysm has already upgraded to discv5.1 in Medalla with the help of Proto's catdog bootnode. BLSv4 should be no factor as well. However, the v1.0 configuration changes would require much more careful consideration. Having the code ready for a hardfork makes clients more prepared for mainnet, although we do see this as a potential risk for delay of an imminent phase 0 release.

Medalla has a bunch of syncing data so we can have a solid place to test more stressful syncing when mainnet goes live. It has been expressed by some devs that this is an asset to the development and release process and starting mainnet without the asset in place

I think syncing data is critical to mainnet success. Imagine a weak subjectivity sync bug that doesn't manifest until there is data larger than the weak subjectivity sync period. If the testnet does not have a lead time on mainnet, the bug will be discovered in mainnet first with little opportunity to correct it before users face outages or losses.

Additional thoughts

Perhaps there is a path forward where Medalla can be kept alive and hardforked after mainnet, if the risk of launch delay is too great.

Support Medalla with discv5.1, BLSv4, and everything except spec config changes.
Start another testnet to test spec v1.0.0 testnet changes. A la "dress rehearsal", maybe run longer than a few days.
Fork / upgrade Medalla at client teams earliest convenience. Even post mainnet launch.

This proposal still requires technical effort to implement and more resources to support/monitor two testnets for a period of time. The trade off that we propose is to delay Medalla hard fork functionality until after mainnet genesis, but also to keep Medalla running indefinitely.

Another argument for (1) maintaining Medalla requires keeping pre-v1.0.0 code / config around indefinitely. The technical burdon of this may not outweigh the benefits in option 2 in the long term. i.e. option 1 allows us to delete legacy code before mainnet.

This post was written with support from @terencechain and @rauljordan.

As an attempt to represent the community perspective: There is some attachment to Medalla and the desire to see it finalize again, but in the long run it's a battle that isn't worth continuing to fight. While I preferred keeping Medalla alive before reading today's issue, this argument completely changed my mind:

For example, we can have a set of master exit keys that are allowed to submit exits for any validator. We can then daily sweep validators that haven't been online in some time period and submit exits for them.

This essentially means that interested participants from the community can use the new testnet as a true sandbox without any concern for causing loss of finality. With this suggestion I feel like the community can be very excited about joining a new testnet.

I think we should do both^(*).

We should keep Medalla around but degrade it to devnet. We should not further encourage users to deposit Goerli ETH to Medalla to become validators. However, we should keep it to further monitor the activity leak and eventually we can use it to test a first protocol upgrade, here: v1.0.0-rc.0 spec. Upgrading Medalla to v1.0 allows client developers to have full spec compatibility between Medalla, an eventual mainnet canidate and any further devnets that will follow.

That said, I believe as clients and spec stabilizes, we should still consider abandoning Medalla even if we fork it to v1.0 to have a smaller, cleaner, long-standing testnet after mainnet launch. Such a network could be launched in parallel or slightly after the mainnet genesis with a similar spec.

^*) Ok, now that above is only what I would recommend. However, I also see that need for a functioning testnet by the community (testing staking pools etc.) as well as the limited availability of client teams to maintain Medalla just long enough to even attempt a protocol upgrade. So the path with least resistance would be (1).

Experience with ETH1 is that coordinating a hard fork on a testnet takes a lot of time and effort from the client development teams and others like the cat herders to pull off. ETH2's limited experience suggests that will hold true as well with issues in the Spadina testnet because client teams didn't put enough coordination effort in upfront and confusion in the community with Zinken and Medalla. So I don't see that it's viable to hard fork Medalla and get a MainNet launch this year - both will require too much effort and communication so will need to happen further apart.

Medalla isn't really useful for users to test with at the moment because the lack of finalisation means new validators can't be on boarded. If we did a hard fork, we'd almost certainly lose some more validators and likely lose finalisation again so it's unlikely that Medalla can provide a stable 1.0 spec testnet anytime soon.

And Medalla as currently configured is far too susceptible to people registering large numbers of validators and not running them. ETH1 PoW testnets have been plagued with similar issues with hash power coming and going constantly leading to Clique being developed to provide a more controlled, reliable testnet. So even if we got Medalla back to finalising it would likely go through these long non-finalization periods repeatedly because of lack of interest or deliberate griefing.

So I don't see option 2 as a likely to result in a usable testnet anyway and puts MainNet release in jeopardy. Option 1 seems like the best option we have, but timing of a new testnet will be difficult to not cause confusion with the MainNet launch.

And Medalla as currently configured is far too susceptible to people registering large numbers of validators and not running them. ETH1 PoW testnets have been plagued with similar issues with hash power coming and going constantly leading to Clique being developed to provide a more controlled, reliable testnet.

Been considering this. The easiest equivalent I can think of would be a modified deposit contract that contains a per-sender allowance of deposits (with empty value being 0 to avoid sybil attacks). If someone wants to join the set of validators they can request (somewhere; web page, discord bot?) and be given a small allowance, which could be increased over time if the users is operational long-term. This would give us some control over who is validating for the testnet, and not require any changes to the Ethereum 2 codebase to achieve.

Been considering this. The easiest equivalent I can think of would be a modified deposit contract that contains a per-sender allowance of deposits (with empty value being 0 to avoid sybil attacks). If someone wants to join the set of validators they can request (somewhere; web page, discord bot?) and be given a small allowance, which could be increased over time if the users is operational long-term. This would give us some control over who is validating for the testnet, and not require any changes to the Ethereum 2 codebase to achieve.

Yeah I had something similar in mind in terms of the deposit contract. In terms of the approval side, an automated system could allow any validator that provided a pre-signed voluntary exit message. The exit message would just be kept on file but could be sent if the validator was offline for some period (or manually if someone spammed it). I just worry if it varies the validator on-boarding process too much (though having people know how to exit isn't a bad thing).

I just worry if it varies the validator on-boarding process too much (though having people know how to exit isn't a bad thing).

Tricky. Perhaps we can compromise, so that anyone spinning up a small number of validators (<8?) doesn't have to supply the signed exit messages. That would allow users who aren't likely to be able to influence the network a much more familiar path through validating (we could tie the allowance setting transparently in to the Goerli Ether faucet, for example) whilst still protecting the network from those who can do it harm.

How can i get involved

goerli / medalla