gitpod-io / gitpod

The developer platform for on-demand cloud development environments to create software faster and more securely.
https://www.gitpod.io
GNU Affero General Public License v3.0
12.73k stars 1.22k forks source link

Recommend a workspace-class that is smaller than g1-standard and test feasibility #12708

Closed kylos101 closed 2 years ago

kylos101 commented 2 years ago

Is your feature request related to a problem? Please describe

From a harvester preview VM or workspace-preview, we want to see if a small workspace class (smaller than g1-standard) performs well enough for simple workloads. If it is a good experience, then, we'd want to amend webapp and workspace, to support a new "small" workspace class, prior to enabling UBP and workspace classes for individuals.

Internal context More internal context

cc: @mbrevoort @atduarte

Describe the behaviour you'd like

Idea: alter ws-manager and ws-daemon configs, and test a regular workspace with a class using a limit CPU of 1 and burst CPU of 2 (ws-daemon), memory requests of 2Gi and a limit of 4Gi (kubernetes), with 15Gi of storage, and 5Gi of ephemeral storage. Also, please be sure to enable disk IO limiting on the related ws-daemon, this way the workspace is limited similarly to how we would do in production.

Then, once you're able to start workspaces using the above config, test that the small workspace class configured is works well enough as a regular workspace to develop our website. For example:

  1. Make change codes
  2. How does the file watcher pick them up
  3. Run tests
  4. Commit code
  5. Use extensions associated with the project

What about our Gitpod repo? How does it behave from a development standpoint? The assumption is that it will not a great experience.

Describe alternatives you've considered

Some extensions are resource intense. @akosyakov , can you think of any extensions that are common, but hungry, that we should include as part of this test?

Additional context

Will this be useful enough for JetBrains? @akosyakov wdyt? I assume no, because this workspace will not meet minimum requirements.

Definition of done

If feasible for users, a smaller workspace-class recommendation is shared and agreed with Product and Finance teams, and related issues are added to groundwork to support the related deployment.

mbrevoort commented 2 years ago

Overall we are trying to assess if there is a viable workspace resource size below g1-standard that could support development of what type of projects. The resources described above seem like a great place to start.

mbrevoort commented 2 years ago

In addition, how would such a resource profile affect pod scheduling, resource utilization and noisy neighbor conditions?

akosyakov commented 2 years ago

Some extensions are resource intense. @akosyakov , can you think of any extensions that are common, but hungry, that we should include as part of this test?

java, golang, typescript. keep in mind hungry depends on size and complexity of the project, not extensions itself

Furisto commented 2 years ago

Tested this with the gitpod-io/website repository. Running tests and making code changes works well but trying to build the website leads to OOM kills or network disconnects due to not being enough memory available.

kylos101 commented 2 years ago

@mbrevoort good questions!

In addition...

how would such a resource profile affect pod scheduling

The pods would be scheduled to a dedicated node pool for the small workspace class, so, they wouldn't impact pods being scheduled to other node pools (like standard and large).

Initially we'd control density via memory requests, but, may want to consider CPU requests, too. For example, memory requests would be 2Gi, and CPU might be 1 core, or .5 cores, etc.

I do wonder how many small workspaces we could fit on a node? But, first we need to find a size that works for one workspace. And then we can talk about achieving a desired density. For example, if we run more than 18 small workspaces, perhaps 36, we'd have to reduce the related disk IO bandwidth.

resource utilization

The way this is written now, you'd get 1 CPU, and be able to burst to 2 (controlled by ws-daemon), and you wouldn't be able to use more than 4Gi of memory (controlled by Kubernetes).

and noisy neighbor conditions?

@Furisto can you think of anything new that we might have from a risk perspective, by having a higher density of workspaces?

kylos101 commented 2 years ago

Tested this with the gitpod-io/website repository. Running tests and making code changes works well but trying to build the website leads to OOM kills or network disconnects due to not being enough memory available.

Okay, good to know, @Furisto ! Bummed to hear, though. :wink:

Furisto commented 2 years ago

@Furisto can you think of anything new that we might have from a risk perspective, by having a higher density of workspaces?

Higher chance of WorkspaceStuckInStopping alerts because more workspaces mean more backups and we have a concurrency limit for backups. On the other hand the backups should be smaller, so they will go faster.

kylos101 commented 2 years ago

@Furisto can you share a link to the branch where you made these related changes? It would be good to get a second pair of eyes 👀 on the related configuration for the experiment.

@mbrevoort would you like any further investigation to this resource permutation, or other resource permutations? The small configuration we originally socialized does not look promising.

jldec commented 2 years ago

Tested this with the gitpod-io/website repository. Running tests and making code changes works well but trying to build the website leads to OOM kills or network disconnects due to not being enough memory available.

@Furisto @kylos101 - Would you be able to share more details about which processes are consuming more memory and causing problems when building the website?

kylos101 commented 2 years ago

👋 @Furisto in hindsight, in talking with @aledbf , let's change the test as follows:

  1. Make a new branch in the ops repo with these related changes, and share with us here for reference
  2. Deploy an ephemeral cluster
  3. Share cluster access using new-workspace-cluster to @mbrevoort and @jldec , so they can experience the related performance/behavior.
Furisto commented 2 years ago

@kylos101 @jldec Tested this on an ephemeral cluster and the performance looks better there (presumably due to swap). See https://www.loom.com/share/fc6fb07dd05b4621841d90f5a0f41dc8. I have given you access to the ephemeral cluster @jldec

kylos101 commented 2 years ago

@jldec let us know what you think? 🙏

Wow, nice, @Furisto ! As a next step, I recommend you:

  1. Share a link to the branch containing your configuration changes? This way we can review it.
  2. Do a test with loadgen and a regular workspace you actively use. For example, fill up a node (leaving enough room for 1 regular workspace), and then connect with a regular workspace, to see what the performance is like? I ask because of related limiting/tuning that could be needed.
  3. Think whether any other tests are needed? 🤔

Depending on output from you both, we'll need to create some issues:

  1. add the class configuration to webapp and workspace
  2. update logic in server, so we can have two pools, one for PVC and one without
  3. update configuration in the packer image, to account for the new pools
  4. update configuration in the ops repo, to build the related instance groups

CC: @atduarte for awareness

jldec commented 2 years ago

Initial (single workspace) evaluation with the gitpod-io/website repo did not reveal any issues. I was able to npm run build and edit content while watching the dev server live-reload. Memory and CPU both crept up toward 100% during the build, but I did not observe any errors.

Furisto commented 2 years ago

This is the configuration that I am using. I just edited the configmap of ws-manager directly:

"g1-standard": {
            "name": "",
            "container": {
              "requests": {
                "cpu": "1m",
                "memory": "3328Mi",
                "ephemeral-storage": "5Gi"
              },
              "limits": {
                "cpu": {
                  "min": "1",
                  "burst": "2"
                },
                "memory": "4Gi",
                "ephemeral-storage": "5Gi",
                "storage": "15Gi"
              }
            },
            "templates": {
              "defaultPath": "/workspace-templates/g1-standard-default.yaml",
              "regularPath": "/workspace-templates/g1-standard-regular.yaml",
              "prebuildPath": "/workspace-templates/g1-standard-prebuild.yaml",
              "imagebuildPath": "/workspace-templates/g1-standard-imagebuild.yaml"
            },
            "pvc": {
              "size": "15Gi",
              "storageClass": "csi-gce-pd-g1-standard",
              "snapshotClass": "csi-gce-pd-snapshot-class"
            }
          },

Here is a video of me using the workspace while the loadtest is running. I am building the project while editing the code and just navigating around. You can see that code completion is not working very well, but otherwise it is fine. Once the build was complete, code completion was ok again.

Furisto commented 2 years ago

@jldec @mbrevoort @atduarte Given the above information, would you like to proceed with adding a smaller workspace class?

kylos101 commented 2 years ago

@Furisto thank you for this bit of info:

You can see that code completion is not working very well, but otherwise it is fine. Once the build was complete, code completion was ok again.

I think that is acceptable given the small workspace class. Appreciate you sharing the result! :+1: Let's wait to get feedback from @jldec @mbrevoort and @atduarte before proceeding.

kylos101 commented 2 years ago

Thanks for your insight and effort, @Furisto ! 💪

@jldec @atduarte we'll close this for now. Let us know if you'd like a small workspace class to be created? We'd have to schedule and ship changes in ~3 repos, to have a small workspace class as a heads up.

jldec commented 1 year ago

@kylos101 - I did not see the same long delays as in the video when I tested the website repo with the small workspaces myself. Maybe I had a better connection to the environment (I was in Dublin) or maybe the editor caches in my workspace were more warmed up.

I think another test would be helpful to better understand the behavior, but in general I would not hold up the introduction of small workspaces for this reason alone.