readersclub / linkerd-lf

Introduction to Service Mesh with Linkerd by Linux Foundation - MOOC, EdX
Apache License 2.0
0 stars 0 forks source link

Chapter 7. Getting "Golden Metrics" for Your Applications #8

Open anitsh opened 3 years ago

anitsh commented 3 years ago

Chapter Overview

In the last chapter, you learned how easy it is to deploy the Linkerd control plane using the CLI and collect metrics, not to mention secure your services, in just a few minutes.

In this chapter, we're going to take a detailed look at those metrics and learn what they mean using a sample application called Emojivoto.

In the first section, we'll review the concept of "golden metrics" that we discussed in earlier chapters of this course. Then, we'll go back to the hands-on work and deploy the Emojivoto application that we'll use for the remainder of the course. We'll use the Linkerd CLI and dashboard from the previous chapter to dig into the Emojivoto metrics offered by Linkerd. We'll also explore commands like tap and top that give us insight into live request streams.

Finally, we'll end the chapter by looking at how Linkerd's metrics can be used to extend the capabilities of your current observability framework or build the base for a new one.

anitsh commented 3 years ago

Learning Objectives

By the end of this chapter, you should be able to:

anitsh commented 3 years ago

What Are the "Golden Metrics" and Where Do They Come From?

Early in this course, we introduced the "golden metrics" (or "signals") as part of the conversation around the observability features that Linkerd offers. To review, the classic definition of the golden metrics for service health are:

        Latency
        Error rate
        Traffic volume
        Saturation

The value of Linkerd is not simply that it can provide metrics like these—after all, you could simply instrument the application code directly. Rather, the value of Linkerd is that it can provide these metrics in a way that is uniform across your application, and that requires no change to application code. In other words, no matter who wrote it, what framework it used, what language it was written in, and what it does, Linkerd can provide these metrics for your service (at least, if it speaks HTTP or gRPC!).

Let's examine the golden metrics in turn, and how Linkerd measures them.

anitsh commented 3 years ago

Latency

Latency is the time it takes to respond to a request. For Linkerd, this is measured as the time elapsed between the Linkerd proxy sending a request to an application and receiving a response. Because it can vary wildly across requests, the latency for a given time period is typically measured as a statistical distribution, and reported as the percentiles of this distribution. A full discussion of latency percentiles is outside the scope of this course, but Linkerd is able to report commonly-used latency metrics such as p50, p95, p99, and p999, corresponding to the 50th, 95th, 99th, and 99.9th percentiles of the latency distribution of requests. These are called the "tail latencies", and are typically the salient metric for reporting the behavior of a system at scale.

anitsh commented 3 years ago

Error Rate

The error rate, as you might imagine, is the percentage of responses that are considered error responses. For Linkerd, this is measured by the HTTP status code: 2xx and 4xx responses are considered successes, and 5xx responses are considered failures. Perhaps optimistically, Linkerd reports success rate rather than error rate.

Note one subtlety here: while 4xx HTTP response codes correspond to various forms of "the resource you requested is not found", these are correct responses on the server's part, not erroneous responses. Thus, Linkerd considers these requests as successful: the server did as it was asked.

anitsh commented 3 years ago

Traffic Volume

Traffic volume is a measure of demand that is placed on a system. In the context of Linkerd, this is measured as a rate of requests, e.g. requests per second (RPS). Linkerd calculates this simply by counting the requests that it proxies to an application.

There is a subtlety here as well: since Linkerd can automatically retry requests, it provides two measures of traffic volume: actual (corresponding to the requests, including retries), and effective (corresponding to the requests without retries). If a client is issuing requests to a server with Linkerd in between, the effective count will be the number of requests issued from client; the actual count will be the number of requests received by the server.

anitsh commented 3 years ago

Saturation

Saturation is a measure of the consumption of the total resources available to a service, e.g. CPU or memory. Like all service meshes, Linkerd has no direct mechanism to measure saturation. However, latency is often a good approximation. The Google SRE book says:

"Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation."

Thus, for the remainder of this course, we'll restrict ourselves to the three salient golden metrics: success rate, request rate, and latency.

One final note is that, while Linkerd can proxy any TCP traffic, these golden metrics metrics are only available for services that speak HTTP or gRPC. This is because these metrics require "Layer 7", or protocol-level, understanding to compute. An HTTP stream has a notion of successful and unsuccessful requests; an arbitrary TCP byte stream does not (later in this chapter we'll see that Linkerd does provide TCP-level metrics, but they are necessarily much more limited).

With metrics under our belt, let's deploy the Emojivoto application and take a look at these metrics in action.

anitsh commented 3 years ago

Deploying Emojivoto

Everybody loves using emojis, but just which one is their favorite? That's what the Emojivoto application does: allows you and your friends (or enemies!) to vote on their favorite emojis.

For this course, we're going to use Emojivoto to learn about Linkerd. The code and application are available in the Emojivoto GitHub repository.

Emojivoto is a gRPC application that has three services:

        web: the frontend that users interact with
        emoji: provides API endpoints to list emoji
        voting: provides API endpoints to vote for emoji

There is also a fourth component, vote-boti, that generates simulated traffic to the application.

Let's first deploy the Emojivoto application to your cluster:

kubectl apply -f https://run.linkerd.io/emojivoto.yml

Note that we are not using Linkerd yet! After deploying, you should have four Deployment and three Service resources in the emojivoto namespace. List them with this command:

kubectl get all -n emojivoto

Now, open the Linkerd dashboard to see that the emojivoto namespace is displayed:

linkerd dashboard

Finally, let's make sure that the application works by voting for your favorite emoji. We'll have forward the ports from the host to the cluster:

kubectl -n emojivoto port-forward svc/web-svc 8080:80

Click around on port 8080 of your local machine, and vote for your favorite emoji!

Note that this application has an intentional bug. Did you find it? One of the emojis will always return a 404 page. Hint: It's a delicious one. More on that later.

With Emojivoto running, let's add Linkerd to the mix.

anitsh commented 3 years ago

Adding Linkerd to Emojivoto

In Chapter 4: "The Data Plane Starring with Linkerd Proxy", you learned that Linkerd is added to an application by injecting Linkerd's data plane proxies into the pods as sidecar containers. You also learned different ways to inject proxies into a service.

Let's put that knowledge into action by injecting the Emojivoto deployments with the Linkerd proxy.

First, annotate the emojivoto namespace:

kubectl annotate ns emojivoto linkerd.io/inject=enabled

This annotation is all we need to inform Linkerd to inject the proxies into pods in this namespace. However, simply adding the annotation won't affect existing resources. We'll also need to restart the Emojivoto deployments:

kubectl rollout restart deploy -n emojivoto

This command will restart the deployments in the emojivoto namespace, and, if all goes well, Linkerd will inject its data plane proxies into the pods. Let's verify this by listing the pods to make sure they are injected with the Linkerd proxy:

kubectl get po -n emojivoto

You should see 2/2 under the "READY" header. This means that each pod contains 2 running containers—the Linkerd proxy, and the Emojivoto component itself.

At this point you can go back to your Emojivoto dashboard and click around just as before. Notice anything different? You shouldn't! Linkerd should have no functional effect on the application.

Let's take a moment to appreciate what we just did. We took a functioning application, and, with a few simple commands, and without configuration or code changes, we added a service mesh to it. That's pretty amazing! And that's the power of Linkerd—no other service mesh can do this, anywhere near this easy.

anitsh commented 3 years ago

Viewing Metrics in the Dashboard

With Linkerd injected, we should be able to view the metrics for the Emojivoto application in the Linkerd dashboard. Let's open the dashboard again with the command linkerd dashboard.

The first difference that you will notice is that the emojivoto item in the list of namespaces now shows 4/4 under the "Meshed" column.

Click the emojivoto link to see the details of the namespace, including an "octopus" graph that shows how the services are related to each other through their network connections. Remember this image because we're going to take a look at the same information using the CLI in the terminal.

image "Octopus" Graph Showing the Connections Between the Emojivoto Services

You can also see the golden metrics that we discussed earlier: p50, p95, and p99 latencies, the success/error rates of the services, and the request volume as Requests Per Second (RPS). From the lowered success rate, it looks like one of the services has some errors, but don't worry about that for now—we'll use Linkerd to debug those in an upcoming chapter. image Golden Metrics for the Emojivoto Services

This Deployment level information is the aggregate of the metrics for all the pods handling requests for the application. As you scroll down the page, you will see that the same metrics are available for the individual pods.

Let's make the display more interesting by increasing the number of replicas for the web Deployment:

kubectl scale deploy/web -n emojivoto --replicas=2

As soon as this command is executed, the dashboard will update itself and the web deployment will show 2/2 under the "Meshed" column. In addition, you will see another web pod under the Pods section.

image Emojivoto web Pods After Updating the Replicas

This update may seem like magic, but the way it works is straightforward! The dashboard queries the Kubernetes API server for information about the resources (Deployments, Pods, etc.) in the cluster. The dashboard logic also queries the linkerd-prometheus component (covered in Chapter 5: "The Linkerd Control Plane" ) to get the metrics for the services in the mesh.

The last thing we'll look at in this section are the TCP level metrics that Linkerd provides. At the bottom of the page for the emojivoto namespace, the number of connections are displayed along with the number of bytes read and written for each of the pods.

image TCP (Layer 4) Metrics for the Emojivoto Pods

The metrics at the TCP layer (Layer 4) are more sparse than the metrics at Layer 7—as we discussed earlier, there is no notion of a request in an arbitrary TCP byte stream, for example. Still, these metrics can be useful when debugging connection level issues for an application.

In the next section, we'll continue to explore the dashboard and look at the tap functionality which allows us to see traffic in real time.

anitsh commented 3 years ago

Linkerd Dashboard Metrics: Real-time Traffic with tap

So far, we've used the dashboard to get aggregate performance metrics for the services in the Emojivoto application. Now, let's go a level lower and use the dashboard to watch traffic in real time with the tap functionality that Linkerd offers.

In the dashboard, we can see that the voting service has a success rate that is less than 100%. Let's use tap to look at the requests to the service to see if we can figure out what's going on.

image Success Rate as Displayed in the Dashboard for the voting Service

Click the voting link of the emojivoto deployment to drill down into the details and the first thing that you will see is a diagram showing the relationships between the voting deployment and the other deployments in the application.

image Diagram of Connections and Traffic for the voting Service

Just below the diagram, you will see the Live Calls tab which shows the real time calls being made to the voting service! As each call comes in, the rows in the table are updated with high level information about the request, including the HTTP status of the response. Before we go into detail about those live calls, click on the Route Metrics tab to see a table of routes for the voting service and the metrics for each one. In this case, there is only one route named "Default", which is created for every service. In the next chapter, we'll cover service profiles and how adding them to your application affects the display of this tab. For now, it's enough to know that this tab exists.

image Live Calls for the voting Service

Now that you know how to find the live calls in the dashboard, let's see if we can find one of the failing calls and use the tap functionality in the dashboard. Once you see a request for the path /emojivoto/v1.VotingService/VoteDoughnut, click the microscope icon to the right of it to go to the Tap page (did you find this bug when you were clicking around in the application?).

The Tap page contains a form with several fields that have been pre-populated based on the link for the particular request that you clicked. In this case, the path, namespace, and resource fields have been populated. There is also output that displays the current tap query that is being run.

image Tap Page with Pre-Populated Fields and Current Tap Query

Click the button labeled "Start" at the top of the page to begin to tap the requests for the /emojivoto.v1.VotingService/VoteDoughnut path of the Emojivoto voting service. After a few seconds, the table will begin to populate with the incoming requests for the VoteDoughnut path. Click the arrow on the left side to see a dialog with the request information.

image Requests Collected by tap for the voting Service

image Details of a Failed Request

So, that's how you use tap in the Linkerd dashboard! Go ahead and change the values in the form fields and use a different query to see different requests. For example, if you remove the /emojivoto.v1.VotingService/VoteDoughnut value from the "Path" field and set the "To Resource" field to "deployment", when you click the "Start" button, you will see all the traffic that is sent from the web service. Do you see any different services receiving traffic from the web service?

Now that you know how to use tap to see traffic metrics for a service, let's see how those metrics are used by looking at the Grafana dashboards that are included with Linkerd.

anitsh commented 3 years ago

Displaying the Metrics in the Dashboard and Grafana

Grafana is an interface for displaying many different kinds of dashboards. Linkerd uses Grafana to add an additional level of observability for applications deployed to Kubernetes.

You may have already noticed the Grafana icon while navigating through the dashboard. Since we've already used the web and voting services, we'll use the emoji service as the example for this exploration of the Grafana charts.

In the emojivoto namespace of the Linkerd dashboard, click the Grafana icon in the far right column in the "emoji" row to open the Grafana dashboard for the emoji deployment. The graphs on these pages show the time series data for the metrics that are displayed in the Linkerd dashboard. In this case, you're looking at the performance, over time, of the emoji service.

image Grafana Dashboard for the emoji Service

The charts on the dashboard include our standard set of golden metrics:

        Success rate
        Request rate
        Latencies

The ability to see the graph of the golden metrics over time is a very powerful tool for understanding the performance of your application. Viewing these metrics as time series allows you to see, for example, how a service performs when traffic load increases, or how one version of a service compares to another version when an update is made to add features or fix bugs.

The great thing about the Grafana dashboards is that you don't have to do anything to create them. Linkerd uses dynamic templates to generate the dashboards and charts for every Kubernetes resource that is injected with the Linkerd proxy and part of the service mesh.

That last point is significant, so let's look at an example to illustrate what it means. In the top left corner of the Grafana dashboard, click the link that reads "Linkerd Deployment" to open the list of available dashboards.

image Linkerd Deployment Navigation Link

image Dialog Showing List of Available Dashboards

Click the "Linkerd Pod" dashboard to see the charts for a pod associated with the emoji deployment. The dashboard that is displayed shows the same golden metrics for an individual pod, and this is different from the Deployment dashboard, because the Deployment dashboard shows the aggregated metrics for all the pods associated with the Deployment.

image Charts for a Pod Associated with the emoji Deployment

anitsh commented 3 years ago

Revisiting the Linkerd CLI tap Command

The Linkerd dashboard is powerful because of the amount of metrics that it displays within the browser-based interface. But not everyone wants to use a browser, and that's why the Linkerd CLI offers the same functionality that the dashboard offers in your terminal!

In the last chapter, we used the linkerd tap command to see the traffic between the components in the Linkerd control plane and in this chapter, we used the dashboard to tap the traffic from the web service to the other services. Let's revisit the tap command to run the same query that we ran in the dashboard to see the real time traffic in your terminal.

Begin with tapping the traffic from deploy/web:

linkerd tap deployment/web --namespace emojivoto --to deployment/voting --path /emojivoto.v1.VotingService/VoteDoughnut

It may take a minute, but you will eventually see the output of a request that results in an error. This is a very specific query that shows the level of granularity that you can achieve with the tap command and it translates to: show me all the traffic in the emojivoto namespace from the web deployment to the /emojivoto.v1.VotingService/VoteDoughnut path of the voting deployment.

We can get even more detail from tap by using the -o json flag to specify that the output should be in JSON format. If the command above is still running, press Ctrl+C to break it, and then run the same command, but this time, add the -o json flag to see more verbose output.

So, tap traffic from deploy/web and output as JSON:

linkerd tap deployment/web --namespace emojivoto --to deployment/voting --path /emojivoto.v1.VotingService/VoteDoughnut -o json

You can see that the JSON output is much more verbose, because each request prints multiple lines of information about each request, including:

        The HTTP method
        The direction of the traffic, relative to the resource which is being tapped (deploy/web, in this case)
        The HTTP headers

Take a close look at the output of a request. Do you see anything else interesting in there?

Let's run one more query with tap that is less focused, just like the one that we ran in the dashboard. Tap all the traffic from the web deployment:

linkerd tap deploy/web -n emojivoto

Can you spot the ways that this query is different from the query we ran before? The command is much shorter, because we have removed the --to and --path flags and their arguments. This "widens" the search parameters, so the output will show all traffic to and from the web deployment. The output should include traffic between the web and emoji services as well as the web and voting services.

The other, more subtle, way that the query is different is that we used the -n flag instead of --namespace to specify the emojivoto namespace. This shorthand syntax will save you from typing --namespace every time.

You can see the direction of the traffic based on the "src" and "dst" fields in each line of output. Try running the query again with the -o json flag to see the output in JSON format and see if you can discover the direction of the traffic for a given request.

anitsh commented 3 years ago

Sorting Real-time Traffic with the top Command

In the last section, you learned how to display traffic in real time, in your terminal with the tap command. The linkerd top command presents the same information, but in the same format as the Unix-based top command. In other words, linkerd top shows traffic routes sorted by the most popular paths. Let's go back to the emojivoto deployments to see an example.

Use linkerd top to view traffic sorted by the most popular paths:

linkerd top deploy/web -n emojivoto

anitsh commented 3 years ago

Using the Linkerd stat Command to Display the Metrics

Let's say that you want to see the golden metrics, the latencies, success/error rates, and requests per second that you saw in the dashboard, except this time, you want to see them in the terminal. The Linkerd stat command will do just that for you. Let's try it out by getting the metrics for all the deployments in the emojivoto namespace:

linkerd stat deploy -n emojivoto

Any time that you want an up-to-date snapshot on the performance of the services in your application, you can use linkerd stat to get these metrics. If you want to go a bit deeper and get the number of bytes written and read, you can add the -o wide flag to get those TCP level details. Regardless of whether you use the -o wide flag, the TCP connections will always be displayed. Give it a try now.

In the last section, we started with a very specific query and then widened it to see the output of the tap command. The stat command we ran in this section starts off very wide by querying for the metrics of all the deployments in the emojivoto namespace. So, let's narrow the query to focus in on the traffic from the web deployment to the emoji deployment:

linkerd stat -n emojivoto deploy/web --to deploy/emoji

The output is similar to the output from the command above, but there is only one row and the latencies and success rate is now at 100%. Why do you think that is? Let's take a look at the traffic between the web and voting deployments to investigate. Use the narrowed stat query to view the metrics:

linkerd stat -n emojivoto deploy/web --to deploy/voting

Again, there will be one row of output, and the metrics are different. Most notably, the success rate is less than 100%. From this output, we can infer that the success rate for the web deployment when we look at all the deployments in the emojivoto namespace is the aggregate of the responses from the voting and emoji services. To test this hypothesis, let's run one more query to see only the traffic from the web deployment to all the other deployments in the namespace.

Use the stat command to view the metrics for traffic to all deployments in the emojivoto namespace that comes from the web deployment:

linkerd stat -n emojivoto deploy --from deploy/web

NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN emoji 1/1 100.00% 1.6rps 1ms 3ms 3ms 1 voting 1/1 89.58% 0.8rps 350ms 395ms 399ms 1

Here we can see that there are, in fact, errors coming from the voting deployment, and none from the emoji deployment. In short, we have combined the last two commands to show us the output in one tidy display.

The stat command is a powerful tool with many configuration options. Run linkerd stat -h to see all the available ways that you can run stat and put together a few of your own commands to see the output from the services in the emojivoto and linkerd namespaces.

As you may have noticed in the output above, the Linkerd control plane is not only "aware" of the services that are meshed, but it also is aware of which services communicate with each other. We'll learn how that works in the next section.

anitsh commented 3 years ago

Service Topography with the Linkerd edges Command

Do you recall the "octopus" graph from the earlier section that shows the connections between the services?

image "Octopus" Graph Showing the Connections Between the Emojivoto Services

It's possible for Linkerd to generate this information because the proxy metrics contain information about the endpoints with which they communicate. Linkerd can compile this information into a live graph of connections between services.

In the last section when you ran the stat command, you saw the metrics displayed for traffic between web and emoji as well as web and voting. The Linkerd edges command is a simple way to see that information in the terminal, so let's try it out. Get the edges for all the deployments in the emojivoto namespace:

linkerd edges -n emojivoto deploy

The output may surprise you a bit, because you can also see the connections from the linkerd-prometheus deployment in the linkerd namespace. This shows that linkerd-prometheus is connecting to the proxies to scrape the metrics from them so that we can use them!

In each of these examples, we've changed the perspective of the commands with different parameters. The view of edges between the deployments is great for a high level view of which services communicate with each other, but sometimes it's nice to have a more granular view of the system. We can zoom in on the service graph by looking at the edges between pods rather than deployments, like this:

linkerd edges -n emojivoto po

If you still have the web deployment scaled to two replicas, then you will see output like this:

SRC DST SRC_NS DST_NS SECURED vote-bot-7958f5bdbb-gw2kr web-6468f9d579-8b4vg emojivoto emojivoto √ web-6468f9d579-8b4vg emoji-7fb6f4469-xf9f7 emojivoto emojivoto √ web-6468f9d579-8b4vg voting-7fd748d5db-6jtln emojivoto emojivoto √

In this output, there are four unique pods, two of which are web pods, one which is an emoji pod, and one which is a voting pod. This is consistent with the current replica settings for our deployments, and it shows much more detail about which pods have edges. Can you imagine what this would look like in an environment with many replicas of many deployments?

Just like the stat and tap commands, you can specify the -o json flag to write the output as JSON. This makes it very easy to integrate with an external system that can consume the JSON.

anitsh commented 3 years ago

Summary

This chapter was your introduction to using Linkerd to view the golden metrics using the Emojivoto demo application. As you learned early on, the golden metrics collected by Linkerd are: latencies, success/error rates, and traffic as requests per second (RPS).

Once the application was deployed to the cluster, you added the "linkerd.io/inject: enabled" annotation to the emojivoto namespace, to configure the proxy-injector component to inject the Linkerd proxy into the pods. Then, you restarted the emojivoto deployments to ensure that the pods were injected with zero downtime to the Emojivoto application.

With that the Emojivoto application was properly "meshed" and you learned how to use the Linkerd and Grafana dashboards to view the golden metrics in your browser. Then, you learned how to view the same data in your terminal with the Linkerd CLI with the tap, top, edges, and stat commands.

All of these commands are very powerful and you are encouraged to explore each of them on your own to learn more about the additional parameters they offer and different output that you can collect.

In the next section we're going to learn how to get per-route metrics using service profiles. By creating a ServiceProfile for a Kubernetes service, we can specify the routes available to the service and collect indi