mito-ds / mito

The mitosheet package, trymito.io, and other public Mito code.
https://trymito.io
Other
2.31k stars 163 forks source link

More granular deployment #112

Closed aarondr77 closed 1 year ago

aarondr77 commented 2 years ago

Describe the request Currently, we have a mix of big technical changes that have a higher probability of introducing bugs. For these PR's we ensure to do thorough testing that takes almost a week to complete from doing sanity checks, hiring UpWork users, reviewing the videos, doing workflows ourselves, and addressing bugs.

At the same time, we have a lot of high impact, small bugs that we don't need to test so thoroughly. For example, handling new file encodings and fixing sheet crashing errors.

I suspect that this trend will continue as we continue to focus on new features + special deployments while also consistently staying on top of robustness. In fact, it might intensify as we want to iterate on final UI implementations with the feedback of active users and Upwork users testing Mito through test-pypi.

With the new retention tracking, its really important that we get as many users in each month on the best version of Mito possible. Waiting a week to deploy bug fixes means that hundreds of more users have the potential to be impacted by that bug and those 100 users are part of our retention goal.

All of that is to say, it would be nice if we designed our deployment procedure to allow us to deploy low risk PR's quickly, while also allowing us the time required to thoroughly test high risk PRs.

Potential Solutions

naterush commented 2 years ago

Thanks for the issue! I think this is a great thing to be thinking about - improving the process by which we get code to our users is always beneficial, so really happy to hear your thoughts here.

I've been thinking about our deployment a lot recently. Here is a fun, wacky model I developed (because I like developing formalisms) that help explains my perspective.

A Toy Model of Deployment

We have an application we deploy changes for. There are two types of changes:

  1. Big. Big requires a lot of testing across the entire app, as some fundamental assumptions have changed.
  2. Small. Small requires not a ton of testing, because the changes are small.

For any specific change, we notate it is Big_10_13. The first subscript 10 is the day the change was "finished" (aka, merged into dev), and the second subscript 13 is the day this was deployed to our users (aka, merged into main). We can get these days with helper functions get_day_finished and get_day_deployed respectively.

Change sequences

One fundamental aspect of these changes is that they are constantly rolling out. Since we're interested in matters of velocity, we want to have some notion of sequence of changes. Let's define an arbitrary Changes sequence as:

Changes = Big_0_1, Small_10_15, Small_14_17, ...

We order a change sequence by when the changes were deployed.

Totally produce velocity in a change sequence

Consider an arbitrary change sequence Changes. What does velocity mean in this context? We can define it pretty much as the inverse of following function:

def waiting_time(Changes: List[Change]) -> float:
    return sum(get_day_deployed(change) for change in Changes) / len(Changes)

A few notes about this function:

  1. You can intuitively understand this function as calculating the number of days a into the process a change was deployed. Minimizing this time is good (changes came out faster). The opposite of this function is pretty much product velocity!
  2. Important: we don't weight Big and Small changes differently from a "value" perspective. We've had small changes that have delivered a ton of value to users (stopping sheet from crashing), and big changes that have delivered 0 value to users (internal state refactors). We generally don't want to optimize for big changes - we want to optimize for big improvements!

Your proposals in the above context

For any Small change, you're looking to minimize get_day_deployed(change) - get_day_finished(change). As we're not changing the day finished [at least by base assumptions], this will thus necessarily decrease the day deployed, and in turn minimize the waiting_time (and thus maximize our velocity).

So - what's the deal? It doesn't seem like there are any issues with this proposal in the above model - but that is notably because we left out any notion of cost! Let us now add costs to this model. Specifically, we will be considering costs to developers and costs to users.

Costs to developers

For any Big change:

For any Small change:

Change induced cost sequence

Thus, for a set of Changes, we have an induced cost C, that in expectation looks like:

Changes = Big_0_1, Small_10_15, Small_14_17, ...
Expected_Cost = (Cost_Deploy_Big * 1 + Cost_Bug_Big * .2) + (Cost_Deploy_Small * 1 + Cost_Bug_Small * .05) + (Cost_Deploy_Small * 1 + Cost_Bug_Small * .05) = (1.5 + .1) + (.2 + .025) + (.2 + .025) = 2.05

Thus, with this specific release sequence: in 17 days of work, we spend ~2 of them paying the cost of deployment.

This "static cost" calculation is the obvious way to think about the cost of more deployments, and it is misleading. The main benefit of this model is that it makes clear the limitations of just calculating just this single number - see below!

Bringing changes back into product velocity

So, how do we include these costs in our calculation of velocity generally - given that these costs are born in developer time? First, let us note the simplifying assumption there is one developer, who can work on one thing at a time (2 devs just shifts thing by a constant factor, which we can include in our constants, which are off by some amount anyways).

Here's how we do it.

  1. Start with the "idealized" change sequence - how things would go out if deployment was free.

    Changes = Big_0_1, Small_10_15, Small_14_17, ...
  2. Adjust each change by the cost of deployment, in real time

    Changes_adj = Big_0_1, Small_10_15 + (1.5 + .1), Small_14_17 + (1.5 + .1) + (.2 + .025), ... = Big_0_1, Small_11.6_16.6, Small_15.725_18.725

    That is, the cost of deployment bumps the release date of each later change by some amount - and notably this cost is born by all later releases - which in turn bumps up waiting time! The more releases in the future, the more an extra deploy in the current day adds to waiting time.

Some conclusions from this simple model

So, what we can take away from all the above? A few things, some of which are obvious and some of which are actually really goddamn interesting:

Ways to increase velocity

First, this model points us to a few ways of increasing velocity:

Minimizing undeployed WIP

Minimize the difference between get_day_deployed(change) - get_day_finished(change). This captures your suggestions - and generally can be understood as minimizing undeployed WIP!

Minimize costs.

Cost_Deploy_Big is the most expensive, and could be optimized so it is more efficient (e.g. we could hire permanent testers, so we don't have to go through the process of rehiring people; we could also ask them to list bugs instead of having to watch videos - which could cut down the cost by a few hours). Minimizing the cost of bugs is much harder.

Minimize prob of costs

Minimize the probability of having to pay these costs. This is impossible for deployment testing, but not so for the probability of bugs occurring. Better automated testing, etc all minimize these costs.

Do less deploys

One really interesting feature that this model illuminates that isn't immediately obvious: an extra deploy delays all future deploys by the time of that extra deploy.

This is actually pretty massive - because if we just use the waiting_time metric that we use currently, and if we assume that our company is gonna exist for an unbounded time [an unbounded number of deploys in the future], then a single extra deploy increases average waiting time by an unbounded amount.

Pathologically, this means that the best option to minimize waiting time is to never deploy ever until all-the-changes-ever are finished, and then do one massive deploy once they are all done. Of course, this is a failure of the model, and not a real conclusion. So how might we fix it? There are a few ways.

First, we could introduce a discount factor - and say that deploys close to now are better than deploys later in the future. This is interesting, as it turns the waiting time calculation into a sum of future discounted cash flows. This is the most reasonable path forward here, in terms of fixing the model.

In any case, we still end up with a similar conclusion; a single extra deploy leads to costs for many future deploys - which might make this extra deploy not worth it. Notably, if it is worth it depends entirely on the constants that are quite finicky, so we could probably make the numbers say whatever we want here - but the point stands: it's not clear that getting smaller deploys out faster will lead to an overall increase in velocity, due to the cost of the extra releases it will cause.

But more deploys might lead to more efficient deploys

Ah, you say, but this is another failing of them model! It misses that if we do lots of releases, we will get really good at doing releases, and invest/optimize the process. This is the theory of continuous deployment - if we just release a ton, all the time, it will become a non-event!

One really important thing that I left out of the above, but in fact is a major consideration for how our release process works currently, is that in our release we have other goals than just testing our software!

Namely, we have both explicitly agreed that we want and need to spend more time:

  1. Using our software.
  2. Watching other people use our software.

Notably, we both agree this is something that we struggle to do well. Both of these things aren't super fun, so IMO forcing ourselves to do them is actually a good thing that I think we should continue to use the release process to do.

This is a fundamental point IMO: our software process should not just be optimized for speed. It should be optimized to allow us to deliver the highest-quality software, which in turn requires us spending time doing things that aren't necessarily fun or sexy.

To be clear, this isn't related to your suggestion. I just needed to make this point so I can talk about what type of changes to the deployment process I do want.

My conclusions

As noted above, it is unclear from the model if it is worth it to have more deploys to minimize the waiting time for small changes. I personally feel this uncertainty as well - I don't think making these smaller releases faster would be worth it, but I also am very uncertain about this. The constants are very fiddly, and we can just make it say whatever we want to anyways.

If you forced me to, I'd actually say that we would get more productive as a team (and minimize waiting time) if we:

  1. Synced up product and development.
  2. Got on a super regular release cycle of X weeks. Note that this does not mean we change the incremental approach to writing code (e.g. the plan of attack) - but just how we're releasing it.
  3. Make it so our workflows are actually dogfoods on the new product.

I think these changes are out of scope of this issue, and notably come with large costs [mostly of coordination] of their own.

What I would like to do in terms of the release process is making changes that undeniably decrease waiting time (e.g. don't requiring the constants to work out to make sense), while also not conflicting with our other goals of forcing ourselves to use our software more and watch others use our software.

What would that look like? I am not exactly sure, and I want to think about it. Let me know if you have thoughts on those undeniable changes, though!

jake-stack commented 2 years ago

This is cool. If you think the model is broadly useful, let's turn it into a blog.

Also, my note is that it seems like testing is a large amount of time.

"For these PR's we ensure to do thorough testing that takes almost a week to complete "

One idea is that we release a "non-tested" version to a group of power users and let them use the tool for a week -- we would catch the large bugs and probably miss some small ones, but we would also save alot of time on testing, that could be used to develop other features. YC seems to have some guidance on how it's okay to release some bugs.