jupyterhub / team-compass

A repository for team interaction, syncing, and handling meeting notes across the JupyterHub ecosystem.
http://jupyterhub-team-compass.readthedocs.io
62 stars 33 forks source link

Identify credits for the next year of `gke.mybinder.org` #463

Closed choldgraf closed 2 years ago

choldgraf commented 3 years ago

Proposed change

Our annual allotment of credits for gke.mybinder.org runs out in late December (I believe, December 22nd 2021). We won't spend the credits down to 0 at that time, but they will expire on that date.

We need to identify where another round of funding for gke.mybinder.org will come from.

Draft of two pager

See here for a draft two-pager to send to karan

Action plan

We have 3 known options to power gke.mybinder.org:

  1. GCP credits, if they extend them by another year
  2. ~Jupyter Meets the Earth grant, if we can get approval by @fperez depending on whether they're in scope~ - this is likely not an option because it only runs on AWS
  3. Pangeo cloud credits

here's the current plan:

Regardless of all this, we need to release a blog post about the current situation, because it is clearly unsustainable (at least, for me it is unsustainable, and I assume for others as well)

Tasks to complete

choldgraf commented 2 years ago

Update: conversation with Karan and the GCP Research team

@consideRatio and I had a conversation with Karan from Google Cloud. He said that he was hopeful they'd be able to fund gke.mybinder.org for another round of cloud credits. In order to explore this, he'd need a 2-pager style document that demonstrated the impact of mybinder.org as well as the costs over time.

In particular, they care about things that demonstrate diverse and worldwide impact, like:

I've updated the top comment with some next steps about putting together this 2-pager

choldgraf commented 2 years ago

I am going to try putting together a 2-pager ASAP that we can send to Karan, because 1.5 months is not that much time for us to get another round of funding. I would really appreciate any suggestions or help from others! Here are a few things that could be useful to help:

minrk commented 2 years ago

I'll work on gathering some analytics

MridulS commented 2 years ago

@minrk I did some work here https://gist.github.com/MridulS/5accc696311c4f381c05cb70922d3624

Screenshot 2021-11-17 at 12 53 30
minrk commented 2 years ago

Nice! I'll look at getting region data from matomo

sgibson91 commented 2 years ago

I wonder if we can deploy this to the federation too, to make info gathering easier in the future? https://github.com/bitnik/binder-launches (Unfortunately, the link to the instance at GESIS no longer seems to be up)

minrk commented 2 years ago

Still working on analytics, since I've never really dug into matomo before, but here are monthly visits to GKE by continent: visualization-9

and summing all non-NA-EU together shows it's about 50% NA, 30% EU, 20% rest-of-world: visualization-10

https://gist.github.com/28e6c3aeb9e7a208e0986a67892e912d

choldgraf commented 2 years ago

Hey all - thanks for these very helpful graphs! I tried to reproduce some of them but I cannot figure out how to get the Matomo secret to access that data (here's an issue I opened as a result: https://github.com/jupyterhub/team-compass/issues/473#issuecomment-974558681).

Can anybody help me get access to the Matomo data so that we can include country information in this report?

choldgraf commented 2 years ago

Update: draft is ready

Hey all - I took some of the plots here (some directly, some as inspiration) and put together the 2-pager at the link below:

https://docs.google.com/document/d/1DvW8TYgEVWYvsgZKlr4JrmuhLQoYC-jTie0okgnIjp0/edit?usp=sharing

I'd love feedback from folks if they think this looks OK. The goal of the 2-pager is to demonstrate the impact and usage of Binder, but doesn't need to go into a ton of detail. Also note, I couldn't figure out how to get Matomo data myself, so I just went with copy/pasting Google Analytics images, but happy (and prefer) to use Matomo data if somebody can help me get access to it.

I've uploaded some archive launch data + the notebooks to visualize it here: https://github.com/choldgraf/binder-meta

If people would like to make any changes etc to those notebooks, PRs are welcome!

choldgraf commented 2 years ago

Update: sent to Karan for feedback

I know that this is a short turnaround, but we only have about a month before gke.mybinder.org runs out of credits, so I have sent the two-pager above to Karan for some feedback to see if we need to add anything to the 2-pager before he submits internally. I've cc'ed @minrk (as team lead) and @consideRatio (since he's been helping with the GCP Binder move lately) on the email. Will report back with relevant information.

betatim commented 2 years ago

Thanks a lot for putting the numbers together and adding words! I think it is good enough that we could send it already, so now we have a bit of time to make it even better.

I read the draft and left a few comments. Most of them are suggestions/nitpicks.

One thing I was wondering is if we can show/say something about an exciting/new area that is growing in terms of mybinder.org usage. The prime example that comes to my mind is things like executable books where we provide a crucial bit of infrastructure for courses/educational books from around the world that lets them do something that is otherwise super hard to do (executable sections in a text book). But I am not deeply enough into the executable books project/user base to know if there are a handful of neat projects we could point to. Not in detail but as a "this is a new area that is growing and super cool!"

choldgraf commented 2 years ago

Hey all - I have still not heard any specific response from Google, and so I want to start contingency planning for what to do if we do not get new credits in time. Here is what I propose:

Timeline for running out of credits

This will reduce mybinder.org's capacity by about 75%, but I think that's just the reality that we face. Unless somebody has access to a large amount of Google Cloud credits that we can hook into gke.mybinder.org, I'm not sure what else we can do.

betatim commented 2 years ago

Sad times.

We will also need to find a new host for https://github.com/jupyterhub/mybinder.org-deploy/tree/master/images/federation-redirect and tweak its configuration so it will continue to work without GKE as the "prime". I think it makes sense to use the OVH deployment as the new prime site. I think these tasks need to happen to make the move:

I think we can run two instances of the federation proxy in parallel without weird stuff happening. This means it shouldn't be a huge interruption to users.

Where/how should I give feedback on the blogpost draft?

Should we now tweet about the upcoming change? - If we do we give people (who read the tweet) only about 48h notice which isn't a lot. But hopefully there aren't too many people who rely on gke.mybinder.org explicitly or were planning big demos or some such. I think it would be a good idea to do so.

choldgraf commented 2 years ago

We are scrambling a bit to see if we can make up any extra funding from a different source. I also hope to have a more definitive answer from GCP by the end of day US/Pacific. There are two potential other funding sources we might be able to use in a stop-gap fashion.

Either case it not a long-term solution, more like a 1-month stopgap to keep the lights on.

Here's my proposed plan:

REMOVED here and added to the top comment above

I'll update the top comment with this plan for visibility

betatim commented 2 years ago

What does "deploy with Pangeo funding" mean? Switching billing accounts or deploying to a new cluster or something third?

For anything beyond "Switch billing accounts" I think we should start moving the federation proxy as it will be good to have that somewhere else in either case. And it is something we can start doing instead of waiting for the clock to tick down. The closer we get to the lights going out the more hectic things will get, the more hectic things get the more mistakes we will make, the more mistakes we make the more hectic it will get, etc :D So I think starting to move now is worth it.

choldgraf commented 2 years ago

@betatim yep, Pangeo has some grant funds parked at Columbia which are earmarked for a Binder deployment, and we can realistically say it is in-scope for that grant to pay for a short time of mybinder.org. However it'd require setting up a new project under the Columbia.edu cloud org, and re-deploying gke.mybinder.org there. This is why it is the last preferred option

betatim commented 2 years ago

(sorry I edited my last comment above for a long time without clicking "save")

betatim commented 2 years ago

Has anyone asked the current members of the federation how much spare capacity we have there? Maybe we can increase our allocations there to make up for the lost capacity at GKE.

cc @MridulS for gesis, @sgibson91 for Turing (can you tag the right new person please?) and @mael-le-gal for OVH

choldgraf commented 2 years ago

re: action plan, my thinking was to wait until 6PM US/Pacific to gather information before making a decision/taking next steps, since most (all?) people that would be taking action are already on time zone where it is night time anyway, and this would allow us to spring into action in people's morning times if necessary.

re: other federation members, we've asked the Turing, but apparently they're close to running out of credits themselves :-/ not sure about GESIS or OVH.

For some quick numbers from status.mybinder.org, it seems like we are talking about a pretty hefty increase in capacity to make up for it

image

betatim commented 2 years ago

Fine with me to wait till tomorrow morning to take action. If possible we should try and decide before then what it is we will do to give as many people as possible to 👍 or 👎. For example if we don't try and decide ~now/soon tomorrow morning in Asia and then Europe we need to either decide without considering US people or wait a long time for them to wake up.

I think the distribution of user pods across the clusters is biased by two things: the quota we set and the fact that until the quota is hit we send all launch requests for one repo to the same federation. This means who ever gets a popular repo (like the top ones from try.jupyter.org) will get a lot of traffic in one block. And if you don't get that you might be missing a big block.

Something else I don't remember us ever putting all too much thought into is the actual quotas. Should we rethink them? For example Gesis is a big cluster and we set a quota of 20. Typo? Decision? Possible to increase by a factor 5 or 10? I don't know so it would be cool to discuss that.

choldgraf commented 2 years ago

I agree with your assessment, though I think realistically I'm the only US-based person who is likely to be paying close attention here, and I don't understand the current cloud setup well enough to know what the right technical decisions are. I trust whatever the team decides if we get to that point. My focus is going to be on trying to get a definitive answer from GCP and the JMTE grant, since those would be our "least work" solutions if they are possible

sgibson91 commented 2 years ago

@betatim Turing currently only has $800 😕 @callummole is following up (I don't know Luke's GitHub handle)

manics commented 2 years ago

Could we redirect the jupyter/try repos to JupyterLite https://discourse.jupyter.org/t/https-jupyter-org-try-so-slow-why/11288/7 ?

choldgraf commented 2 years ago

Update: The JMTE grant likely won't work

I just heard from @consideRatio that the JMTE grant is only earmarked for AWS, not GCP, so this is likely not a reasonable short-term solution.

choldgraf commented 2 years ago

@manics sure that sounds great to me, provided the environment still works the same

minrk commented 2 years ago

Could we redirect the jupyter/try repos to JupyterLite

That would be a huge load off, and an interesting approach to take. Even if they're not the same, shifting that traffic even as a temporary measure, might help, since it would greatly reduce the short-term cost of operating the main cluster. We can decide after things are stable if we want to keep it that way.

I should be around to work on this during the day in Europe, tomorrow, whatever the decision ends up being. If I have the necessary permissions!

callummole commented 2 years ago

Turing only has £813.36 remaining. I may be able to get about £5k approved quickly (i.e. before Weds, I think), but bigger chunks need a panel and I'm not sure if the relevant people have started their christmas breaks already

minrk commented 2 years ago

The main truly GKE things (I think) are:

  1. the events archive, and
  2. the matomo managed SQL backend

If we lost those, even temporarily, moving to another federation member as the 'lead' would keep things from falling down, even if we had vastly diminished capacity. That shouldn't be too hard to change in the repo config.

Our event data 'going dark' isn't the best, but it's not the worst thing in the world, especially if it's temporary. Plus, the current time of year is also our lowest traffic, typically.

manics commented 2 years ago

@bollwyvl What would it take to replace some of https://jupyter.org/try with JupyterLite?

TLDR: The main mybinder.org member may run out of credits on Wednesday! If it does we want to reduce traffic as much as possible.

minrk commented 2 years ago

I opened https://github.com/jupyter/jupyter.github.io/issues/513 for shifting links to JupyterLite

bollwyvl commented 2 years ago

Chimed in over on the issue: happy to help with whatever we need.

minrk commented 2 years ago

I'm going to run https://github.com/jupyterhub/mybinder.org-deploy/pull/2012 to delete very-old images (before 2021-02-01) because our storage costs are currently $1600/mo and it takes a very long time to delete thousands of images from an image registry. So I want to have a lot of the super old stuff out of the way before the clock is really ticking and our only choice to stop bleeding money is to shut it all down. Plus, then we'll have a better sense of how long it will take to clear it all out.

MridulS commented 2 years ago

For the GESIS deployment, yes we can take much more than 20 pods (it's 80 now, will up it to 100 later if everything is stable). But the way the k8s cluster is currently designed/VM'd (one huge server + couple of small VMs, all the binder pods will be running in one node!), the cluster hits issues (specifically with creating and deploying containers) when too many concurrent requests are sent GESIS way.

I'll try to come up with something tomorrow EU morning (maybe add a couple of test servers to the cluster to increase the capacity, if possible).

choldgraf commented 2 years ago

Hey all - just wanted to give another update since I am signing off in a sec.

I still don't have a confirmation in credits from anybody in gcp. I do have a few leads and am having some more conversations tomorrow. I continue to have generally positive signals, but nothing definitive and no timeline.

Since I'll likely be asleep while folks are discussing this, I'm fine with whatever decision folks take to move forward. If we can get the gke deployment down to as low of a monthly rate as possible, I can foot the bill for a few weeks and try to reimburse myself from a grant somewhere. That might be a reasonable stopgap to keep us from having to move over all the matomo/archive/etc stuff. As long as we get the gke cost to less than, say 500 a month I am ok with that as a short term solution.

There are two other things we'd discussed that aren't encoded in the top comment:

  1. It sounds like gesis can have a modest capacity boost. @MridulS maybe you can coordinate with folks about a reasonable limit to set?
  2. It sounds like we can try directing people to jupyterlite for the top links on try.mybinder.org . I think we should try this tomorrow, as it would cut our cloud usage by about half. I am happy to spend some time tomorrow working on notebook documentation for those repositories, if any changes need to be made. Maybe that's something that @bollwyvl or @jtpio could help with too?

Perhaps some combination of the above two things + finding a small pot of money somewhere will give us enough breathing room to move forward for a bit.

I'll check my email again before bedtime in case I hear back from anybody. If anybody else knows of an easy way to get quick cloud credits on gcp, please chime in - at this point even a few thousand would help us have a little time so we don't have to stress over the holidays (assuming we could just connect it to our project)

minrk commented 2 years ago

I'm also working on some leads for the small pot of funding that would let us keep everything as-is for the short term, and reducing costs to make that an easier goal.

FWIW, the docker registry api image-deletion was going to take at least a few days, but I found a simpler way to do it via the gclould container images API that seems to be going way faster.

callummole commented 2 years ago

Turing now has £5547.87 ($7352) to help with the short term capacity. I can apply for more in the new year.

minrk commented 2 years ago

Terrific, thanks! Should we increase the turing quota? What do you think is a reasonable new capacity? 150 or 200? Do you know what the monthly cost of turing is, now?

minrk commented 2 years ago

Simula can cover up to a month of GKE if we don't come through with any last minute credits, so we don't need to do any drastic migrations. Just need to hook up a new billing account.

For that, I just need to know if I have permission to do that, and exactly what time I should make the switch. Related: are the current credits on the project or the billing account (I.e. would we lose credits if I switched early?)

callummole commented 2 years ago

The the last couple of months the cost has been around $2k per month. So if the current capacity is 80 then perhaps we can could cover 200 for a month while we get more funds?

sgibson91 commented 2 years ago

Would love for Turing in general to think about how we can increase that cluster. The 80 quota was an arbitrary decision I made when setting it up

callummole commented 2 years ago

@sgibson91 I'll investigate.

minrk commented 2 years ago

@callummole great! Feel free to make a PR like https://github.com/jupyterhub/mybinder.org-deploy/pull/2091 for turing when you come up with a reasonable number. It's okay to bump incrementally, too.

choldgraf commented 2 years ago

@minrk's plan sounds great to me!

Do we still wish to try making a switch to jupyterlite today? I am happy to spend time trying to get the right docs set up to make this happen if we think it'd be a big help

(Slowly waking up and doing kid stuff so will be more online in an hour or two)

minrk commented 2 years ago

I think it's still worth doing. Main downside I see is that if there are issues, we don't want to be scrambling to respond to them over the holiday. Upside: lower traffic over the holiday!

I will be focusing on reducing the cost of keeping GKE as-is alive, without needing to squeeze too hard.

As I see it right now, our costs break down roughly 20% each:

The first 3 are basically 'node costs' so if we reduce the number of nodes running, which is mostly proportional to the pods, we reduce all of those. Many of the 'other' fees would similarly scale down as well, but not all. The events archive itself takes up about as much space as 1-2 images, so not worth thinking much about.

Right now, I'm looking at whether our current 1TB SSD per node is useful, or if we should turn that down, and the perennial question of a more cost effective node flavor.

minrk commented 2 years ago

Our core nodes have 250G SSD boot disks, of which the most used is 8G, so that's a bit of a waste. But it pales in comparison to the user nodes, with have two SSDs each - 1TB main disk and 400GB disk just for dind builds. Our intermediate-age user node (gke-prod-user-202009-b9c03ca0-0hbq) has used 85% of the main disk and 50% of the dind disk, and it's only 4 days old.

Note, these disks are caches so they will always fill up eventually. Being full doesn't mean we need them as big as they are, it just means that when they fill up, they will start things like ImageGC to free space. A smaller disk means these things will run more often.

minrk commented 2 years ago

*note: Local SSDs are fixed size, and billed as a separate SKU. Only ~$250/mo compared to $2k for the PD-SSDs.

Reducing the boot SSDs to 500GB per node would save ~$1k/mo. Increasing the node size would also save on storage, but we are already running near the limit of ~100 pods per node, so that wouldn't help.

betatim commented 2 years ago

Making the disks smaller sounds like a good plan. One thing I have at the back of my mind is that IO performance is linked to disk size. So maybe that was the reason for having such large disks (and we seem to end up with spare credits at the end of the year any way -> the time limit of the credits is a bigger factor that the amount). Worse performance is better than no performance though, so yay to smaller disks.

We could also ditch the "two disk" approach and use only the main disk to save even more money. I think OVH has been running in that mode for a while now. It needs a bit of a reconfiguration of the image GC to use an absolute size and not an inode based threshold.

minrk commented 2 years ago

The local SSD is a relatively small cost, so I'm not sure it's an optimization worth making right now. Since the same capacity on the PD SSD is 5x as expensive, merging the two probably doesn't make sense.

It would be interesting if we could get the host docker onto another local SSD and lose the PD-SSD altogether. That would would save tons at a cost of fixed capacity per node. Not sure if that's possible, though.

choldgraf commented 2 years ago

I just got off the phone with the Google OSPO office. They are working to find stop-gap funding (maybe 6 weeks or so) in order to keep Binder running through January (though, if we can bring down the costs, we might be able to extend this a few months). That would buy us some time to work out a longer-term solution that is more sustainable for us (and for them) than this "every 1 year we frantically email people we know at Google" approach thus far. No promises from them, but I'm hopeful we'll work something out and will report back here as I learn more.

minrk commented 2 years ago

New quota increases from federation members have greatly reduced load on GKE prod. I've helped encourage scale-down a little with some cordoning. But I think between (ongoing) stale image deletions and load redistribution, we're looking at at least a few thousand dollars saved today on the monthly bill.