Automatically choose workspace-cluster based on lowest latency.

meysholdt commented 3 years ago

Context: https://github.com/gitpod-io/gitpod/issues/5534#issuecomment-914967098

Problem Statement

We currently have workspace clusters in one region in the EU and one region in the US. To offer service at a good latency (e.g. < 100ms), we will need more clusters, maybe as many as one or two per continent. See https://gcping.com/ for your personal latency to every google cloud region. See the GCP network map for available regions and connections between them.

Prior Art

Collect 'gcping' data from the dashboard by @jankeromnes .

Proposed Solution

The user's web browser should measure the latency for every available workspace cluster and send the measurements to the gitpod-server, so that the server can make an informed decision about what workspace-cluster is best for the user.

Considerations

latency measurement should not slow down workspace startup time
the decision what workspace-cluster to choose should remain with the gitpod-server, because in the future, other factors besides latency may influence the decision: Example: cluster health.

Proposed Design Choices:

to keep workspace startup fast, the latency measurement should be cached. For example in a cookie in the web-browser.
to keep workspace startup fast, the latency measurement should preferable not be done when a workspace starts, but when a user visits any website of gitpod.
every workspace clusters should have a public endpoint that can be "pinged" from the web browser for latency measurement.
the server should make a cache-key and the ws-cluster-endpoints available to the users. The cache-key should encode the public IP address of the user, so that the latency will be measured again if the user changes his/her network.

Example Flow 1:

the user visit gitpod.io/workspaces.
the users browser receives {'cache-key': 'FJJDSKD', "clusters": {"us07": "https://us07.gitpod.io/ping", "sing01": "https://sing01.gitpod.io/ping" } }
the user browser measures the latency to all clusters in the background and stores the result in a cookie: {"us07": 230, "sing01": 60}
When the user opens a workspace, the cookie will be send to the gitpod-server and the server will use the latency measurement to chose the best workspace cluster.

Example Flow 2:

the user opens a workspace. The cookie is already there. No delay during workspace-start.

Example Flow 3:

the user opens a workspace. The cookie is not yet there. The is the case we want to avoid, but I don't think it can be avoided all the time.
measure the latency. Maybe the measurement can be aborted when the first workspace-cluster responds, because the first to respond will also be the one with the lowest latency (duh!). While there is the risk that the measurement is slightly inaccurate and repeated measurements would be needed for more accurate results, it seems like a good compromise to preserve fast workspace startup time. This way, if not cookie is present, 15 to ~200 ms will be added to to the workspace startup time.

csweichel commented 3 years ago

Excellent idea - but we really don't have time for this right now. We'll want to revisit workspace cluster selection once we make a decision on multi-meta.

jankeromnes commented 3 years ago

Prior Art

Collect 'gcping' data from the dashboard by @jankeromnes .

FYI, that proposal is to temporarily gather ping times to all possible GCP regions, in order to decide "where should we create a brand new cluster next?" (and then stop collecting ping times, make a decision, and create the cluster)

The proposal was not to collect ping times in order to decide "which workspace cluster should be used right now?" -- doesn't GCP's load balancer already do that automatically? How does the US vs EU selection work right now? (I assume it's not some custom code we wrote, but GCP selecting a reasonable cluster automatically -- I would hope this would also work with 3 or more clusters without requiring us to write custom code for this)

bigint commented 3 years ago

I think the selection algorithm is broken, Im from India the nearby location is EU but whenever I fire a new workspace it gets created in the US region.

Also I tried with VPN from Vienna that time it created under EU region

🤔

jankeromnes commented 3 years ago

⚠️ Just to re-iterate: This issue suspiciously sounds like we want to re-implement something as standard as a load balancer.

I don't think we want to implement and maintain custom code that measures latency, caches it, and acts upon this data.

If possible, it would be much preferable to let Google Cloud pick the best workspace cluster automatically(!)

Inspiration: Best practices for Compute Engine regions selection > Use Cloud Load Balancing and Cloud CDN:

Cloud Load Balancing, such as HTTP(S) load balancing, TCP, and SSL proxy load balancing, let you automatically redirect users to the closest region where there are backends with available capacity.

csweichel commented 3 years ago

I don't think we want to implement and maintain custom code that measures latency, caches it, and acts upon this data.

If possible, it would be much preferable to let Google Cloud pick the best workspace cluster automatically(!)

Cloud Load Balancing, such as HTTP(S) load balancing, TCP, and SSL proxy load balancing, let you automatically redirect users to the closest region where there are backends with available capacity.

The reason we need to build/maintain something ourselves is that the StartWorkspace request which would need to be regional does not go through a regional load balancer, because it's issued from server to ws-manager, and not from the (regional) user's browser.

csweichel commented 3 years ago

The minimal steps to make automatic cluster choices would be:

add a kind of "ping" endpoint to ws-proxy, so that e.g. ws-eu18.gitpod.io does not answer with 404
add a getAllRegions function to WorkspaceManagerClientProvider which returns a list of ping URLs and names.
make the dashboard execute the RTT pings as outlined above.
extend the createWorkspace and startWorkspace calls on server so that they take a cluster preference, which would then be passed in via the ExtendedUser and become an admission preference. Note, this way the cluster preference plays nicely with the score and cluster status.

Offline we discussed the option of making the workspace cluster (or region) choice explicit on the dashboard. By default we'd select the cluster with the lowest RTT (as outlined above).

However, focusing on the individual cluster instead of a region has several drawbacks:

it's noisy on the dashboard because clusters change very often (with every new workspace deployment)
we need to measure often because of the many cluster changes

Instead, we could introduce a region to clusters. We'd introduce a new region field as admission constraint and on the ws-manager-bridge API. New cluster registrations could provide the region when they're registered. We'd assume that from a latency perspective all regional clusters are equivalent, i.e. a measurement to one cluster is equivalent to that of another within the same region.

meysholdt commented 2 years ago

Not sure why this got labeled "platform". The enhancements would mostly need to happen in components owned by the meta team.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bigint commented 2 years ago

This is still not yet fixed 🤔

From India it always choose us clusters instead of eu

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

chientrm commented 2 years ago

Nah. Just get the coordinate of the user via IP address and pick the nearest server. Every server should be located in a city. AFAIK Gitpod's running on GCP. Moreover, many cloud provider like CloudFlare Pages/Worker already append IP and lat/long in HTTP request header 🤭.

kylos101 commented 1 year ago

:wave: @geropl reopening, perhaps something we can discuss to see if it can be included in an iteration early next year?

gitpod-io / gitpod