anitsh commented 3 years ago

Chapter Overview

This chapter expands on Linkerd's reliability features to cover how traffic splitting can be used to deploy new versions of code in a way that minimizes the potential impact on user experience, in the form of "canary" or "blue/green" deployments.

We'll start by defining our terms. Then we'll learn about how Linkerd's traffic splitting functionality is designed and implemented. After that, we'll jump straight into the hands-on exercise sections where you'll use these features to roll out new code while lowering the potential impact of deploying bugs.

anitsh commented 3 years ago

Learning Objectives

By the end of this chapter, you should be able to:

[ ] Recognize a blue-green/canary deployment pattern.
[ ] Write and deploy a TrafficSplit resource.
[ ] Discover options for automating blue-green/canary deployments.

anitsh commented 3 years ago

What Is a Blue-Green/Canary Deployment?

First, let's define our terminology. The phrases "canary" and "blue-green" (sometimes also called "red-black") all refer to the same general principle: a class of deployment strategy that decouple the deploy phase for new code, when the code sits on the production servers, from the release phase, when this code actually serves user traffic. There are two benefits to doing this: first, it provides mechanisms for limiting the risk of accidentally shipping bad code, as the release process can be gradual (e.g. starting with just 1% of requests). Second, it eases the mitigation strategy if the new release is bad: the old code is still "there", and undoing the release can be a very immediate action, whereas deploys can take significant time.

Unfortunately, the terminology can vary wildly between people and organizations in different ways. For example, some organizations define a "blue-green deploy" as an immediate 100% switch from old code to new code, and a "canary deploy" as any number other than 100%. Other definitions insist that the difference between canary and blue-green deploys is around the level of trust that you, the human, have in the new code. Still others distinguish between "red-black" deploys from blue-green.

In this chapter, we will simply refer to the entire class of techniques as "canary deploys" and leave the debate over definition for others. In other words, regardless of whether the new code is seeing 0%, 1%, 99.9%, or 100%, or is going through a smooth, gradual transition from 0% to 100% over the course of several minutes or hours, we will simply call this a canary deploy. The goal is the same, after all: to reduce the risk of new code by gradually introducing it to production traffic.

anitsh commented 3 years ago

How Does Traffic Splitting Work in Linkerd?

Canary releases are managed in Linkerd via traffic splitting. This feature allows you to distribute requests to different Kubernetes Service objects based on dynamically-configurable weights. While traffic splitting can work with arbitrary Service objects, the primary use case is to divide incoming traffic for a service to different versions of a service.

This traffic split functionality is controlled by Linkerd's TrafficSplit CRD (the TrafficSplit CRD follows the specification defined in the Service Mesh Interface (SMI), which we saw in Chapter 2 of this course. This is one of several SMI APIs that Linkerd implements). Creating a TrafficSplit CRD allows us to control how Linkerd proxies traffic to the Services that the TrafficSplit references.

The TrafficSplit CRD is written in terms of Kubernetes Service objects. A TrafficSplit describes a central root or "apex" service, to which traffic is sent, and one or more backend services, which actually receive it, in proportion to the weights also specified in the TrafficSplit (note that the term "apex" is not used in the Linkerd or SMI docs, but we'll use it in this course to clarify which service we're referring to).

Note also that Service objects in Kubernetes do not necessarily have backing workloads. While this is rare for "normal" services, we'll make use of this feature quite a lot for the apex service of TrafficSplits—since the TrafficSplit causes traffic destined for the apex to actually be sent to the backend services, there is no reason for the apex to actually have a Deployment of its own!

anitsh commented 3 years ago

Using TrafficSplit: Part 1

In this first exercise, we will use the Emojivoto application and create two new Service resources. The apex service will not have an associated Deployment resource. The second service will be an "updated" version of Emojivoto's web service that adds some text to the top of the page.

When those two services are created, we will create a TrafficSplit resource that causes traffic sent to the apex service to split between the original version of the web service and the updated version of the web service. Here is a diagram of the services:

In order to deploy the updated version of the web service, run:

kubectl apply -f https://raw.githubusercontent.com/BuoyantIO/emojivoto/linux-training/training/traffic-split/web-svc-2.yml

The file referenced in the kubectl command contains both the Service and Deployment resources for web-svc-v2. First, we'll verify that the resources were deployed properly. To verify that the web-svc-2 service is running:

        List the Pod: kubectl get po --selector app=web-svc-2
        List the Service: kubectl get svc web-svc-2
        View the page: kubectl port-forward svc/web-svc-2 8080:80
        - Visit http://localhost:8080

If the Service and Deployment are running on the server, then the browser will show the Emojivoto home page, with some out of place text at the very top of the page.

The next service that we create is the apex service, called simply web-apex. This time there will be no pods running and we won't be able to send any requests to the service because there are no endpoints:

        Run: kubectl apply -f https://raw.githubusercontent.com/BuoyantIO/emojivoto/linux-training/training/traffic-split/web-apex.yml
        Run: kubectl get svc -n emojivoto -o wide

In the output from the second command you will see all the services listed. Here, the web-apex service looks like a normal service, so let's dig a little deeper and look at its endpoints:

        Run: kubectl get ep

In this output, you should see that there are endpoints defined for all the services except the web-apex service. This confirms that there are no pods currently backing this service, which is consistent with the description of the apex service above.

Before moving on to the next section, let's look at the stats for the services:

        Run: linkerd stat po -n emojivoto

While there is a pod that has been deployed for the web-svc-2, we can see that it is currently getting no traffic.

Now that the services are running, we'll create the TrafficSplit in the next section so that we can see how the traffic is routed to each service!

anitsh commented 3 years ago

Using TrafficSplit: Part 2

This section is focused on the TrafficSplit resource that we will deploy to the Emojivoto namespace in order to configure the Linkerd proxies to distribute traffic between web-svc and web-svc-2.

First, let's take a look at the TrafficSplit YAML and go through each of the important configuration pieces:

apiVersion: split.smi-spec.io/v1alpha1 kind: TrafficSplit metadata: name: web-svc-ts namespace: emojivoto spec: service: web-apex backends:

service: web-svc weight: 500m
service: web-svc-2 weight: 500m

The relevant part of the configuration above is in the spec section:

        service: The apex or "root" service that clients use to connect to the destination application.
        backends: Services inside the namespace with their own selectors, endpoints and configuration (we'll call these "leaf" services in this chapter).
        - service: The name of the concrete service associated with a pod that can handle requests.
        - weight: It correlates to the percentage of overall traffic that is distributed to the service. The maximum value is 1000m, which you can specify as simply as 1. In this configuration roughly 50% of the traffic will go to each service. As you will see in the exercise, a value of 750m is roughly 75% and 250m is roughly 25%. This naming convention follows the Kubernetes format for defining resources to allocate to CPU and memory.

Now that you understand each of the fields of the TrafficSplit definition, let's apply it and see it in action!

First, use kubectl to apply the TrafficSplit definition:

        Run: kubectl apply -f ttps://raw.githubusercontent.com/BuoyantIO/emojivoto/linux-training/training/traffic-split/web-svc-ts.yml

The linkerd stat command has a subcommand named trafficsplit that shows the stats for all the traffic splits that it is aware of. You can shorten the trafficsplit subcommand to ts:

        Run: linkerd stat ts

Since the vote-bot deployment is configured to send traffic to web-svc.emojivoto:80 we don't see any metrics for the traffic split. So, let's update the vote-bot deployment to send traffic to the web-apex service, rather than web-svc. The file used in the kubectl command below changes the WEB_HOST environment variable in the vote-bot deployment to send traffic to the web-apex service so that the TrafficSplit configuration takes effect.

Next, update the vote-bot deployment:

        Run: kubectl apply -f https://raw.githubusercontent.com/BuoyantIO/emojivoto/linux-training/training/traffic-split/vote-bot-update.yml

After this change, the vote-bot pod will be replaced and the new pod will make requests to the web-apex service. We can verify this in a couple of ways. First, you can use the trafficsplit (ts) subcommand for linkerd stat that you just learned:

        Run: linkerd stat ts

The output will look similar to the table below and you can see that the web-apex service is the APEX service for the web-svc and web-svc-2 web services, which are LEAF services. The output also shows the weight distribution to each of the services. Be sure to remember this command because you are going to use it again.

NAME APEX LEAF WEIGHT SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 web-svc-ts web-apex web-svc 500m 100.00% 0.8rps 9ms 46ms 49ms web-svc-ts web-apex web-svc-2 500m 100.00% 0.9rps 4ms 10ms 10ms

--

The second way to view the traffic is through the plain linkerd stat command. When we last ran this command, the pod associated with web-svc-2 was not receiving any traffic. Let's see the output now that the traffic split has been applied:

Run: linkerd stat po -n emojivoto

When you ran this command in the last section there were no metrics for web-svc-2, and this time you can see that both pods associated with web-svc and web-svc-2 are handling requests.

The TrafficSplit definition set the weight to 500m for each of the services to evenly distribute the traffic. In the real world, you would start with a much lower weight like 1m or 100m for web-svc-2 to make sure that there are no errors. Then, as you gain confidence that the new code is running as expected, you can adjust the weights for each service so that, eventually, web-svc gets no traffic and web-svc-2 gets 100% of the traffic.

Let's manually adjust the weights of both services by editing the TrafficSplit definition. Send 75% of the traffic to web-svc-2 and 25% of the traffic to web-svc:

        Run: kubectl apply -f https://raw.githubusercontent.com/BuoyantIO/emojivoto/linux-training/training/traffic-split/75-25.yml

Now, check the traffic split stats again:

        Run: linkerd stat ts -n emojivoto

In the output, you will see that the WEIGHT column matches the changes that you made, 750m for web-svc-2 and 250m for web-svc.

NAME APEX LEAF WEIGHT SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 web-svc-ts web-apex web-svc 250m 100.00% 0.5rps 8ms 10ms 10ms web-svc-ts web-apex web-svc-2 750m 100.00% 1.2rps 7ms 12ms 19ms

Now, make one final change to the TrafficSplit definition to send all the traffic to web-svc-2 and none of the traffic to web-svc:

        Run: kubectl apply -f https://raw.githubusercontent.com/BuoyantIO/emojivoto/linux-training/training/traffic-split/100-0.yml
        Run: linkerd stat ts -n emojivoto

This time in the output, you will see that the WEIGHT for web-svc-2 is 1 and eventually, all the stats for web-svc will have no value.

At this point, you should be familiar with the basics of splitting traffic in Linkerd. There is one more very important thing you should know: for simplicity, we've used a separate web-apex service in all of our examples. However, the apex service can also be one of the backends—and in fact, this is a common usage! A TrafficSplit with the same service as the apex and as one of the backends will send traffic destined to that service to the service—but in proportion to the rest of the backends. And this can be done dynamically, allowing you to "insert" a TrafficSplit on top of an existing service.

For example, rather than using web-apex, we could simply have used web-svc as the apex (and continued to use it, as well as web-svc-2, as a backend). The moment the TrafficSplit was created, existing traffic to web-svc would have followed the TrafficSplit's rules; and the moment it was removed, traffic to web-svc would resume as normal. Try experimenting with this idea with your existing TrafficSplit.

Congratulations, you have successfully configured a TrafficSplit resource and rolled out a new version of the web service. In practice, Linkerd's traffic split functionality can be integrated with continuous integration and continuous deployment systems to automate this rollout process. We'll introduce you to those concepts in the next section.

anitsh commented 3 years ago

Day 2 Operations: Automation

In this chapter, you've learned how the concept of canary releases can take the risk out of introducing new versions of your services to production environments. You've also learned the powerful primitive—the traffic split—that forms the basis of how canary releases can be implemented with Linkerd.

Traffic split gives us control, and Linkerd's metrics give us visibility. Wouldn't it be nice to tie these two things together, in an automated way? After all, the whole point of a canary release is to incrementally expose the new code to production traffic and validate that everything is working (and roll back if it's not)! Does that "incremental" aspect really require a human in the loop?

This is the idea behind progressive delivery: that by tying metrics and traffic splitting together, the release of new code can be done incrementally, safely, and in a fully automated way. Progressive delivery is a broad topic that we can only introduce in this course, but you are encouraged to explore projects like Flagger that build on top of Linkerd's metrics and traffic splitting features to perform progressive delivery (there is even a tutorial in the Linkerd documents to help you out!).

anitsh commented 3 years ago

Summary

In this chapter you learned about the concept of canary deployments and how they can be used to reduce the risk of releasing new code.

The exercises taught you how to implement a TrafficSplit resource in Linkerd by deploying a new version of the Emojivoto Web service and gradually shifting more traffic to it until the original Web service no longer received traffic. In a real world system, the TrafficSplit functionality can be integrated with a Continuous Integration or Continuous Deployment system to automate the gradual shift of traffic from one or more services to another.

In the next chapter, we'll review all that you have learned in this course and talk about where you can go with your new skills!

readersclub / linkerd-lf

Chapter 11. Canary and Blue-Green Deployments #12

Chapter Overview

Learning Objectives

What Is a Blue-Green/Canary Deployment?

How Does Traffic Splitting Work in Linkerd?

Using TrafficSplit: Part 1

Using TrafficSplit: Part 2

Day 2 Operations: Automation

Summary