The heart of the service mesh to discuss the key functionality provided by Linkerd around observability, security, and reliability.

Linkerd's observability functionality involves automatically collecting metrics, or telemetry, about services, and use that to infer the state or health of those services.

Linkerd's security functionality involves using mutual TLS (mTLS) to create secure lines of communication between services, following the model of "zero-trust" security.

Linkerd's reliability functionality involves using retries, timeouts, load balancing, and traffic shifting to ensure that your application is stable and fault tolerant in the face of partial failures.

And what the service mesh does, the value of the service mesh is that it provides us those features at the platform layer rather than at the application layer, and that distinction is what the tale of the history of the service mesh is all about.

Learning Objectives

By the end of this chapter, you should be able to:

[ ] Explain what a service mesh is and where it fits into your application architecture.
[ ] Understand the history of the service mesh.
[ ] Name the three key concepts which drive service mesh functionality.
[ ] Describe the "Golden Signals" of service health.
[ ] Explain the value of mutual TLS.
[ ] Describe the difference between the control plane and the data plane.

What Is a Service Mesh and When Is It Useful?

A service mesh is an infrastructure layer that adds security, reliability, and observability features to a cloud-native application. Crucially, it adds these features at the platform layer, independent of the application itself. This means that, ideally, the application doesn't even need to be aware that the service mesh is there! It also means that these features are provided uniformly across the application. This means that, regardless of the language of libraries that the application is written in, the service mesh provides the same set of features.

In practice, a service mesh is typically implemented as a set of proxies that are deployed alongside the application, called the data plane, as well as a set of controlling logic deployed outside the application, called the control plane. The data plane observes and manages the communication between services, and the control plane provides the operator with the API and UI for managing those proxies as a whole.

The features that the service mesh provides are all accomplished by measuring and manipulating the traffic between services. For example, the service mesh can provide mutual TLS between services by transparently initiating and terminating TLS for calls between services. (We'll discuss this in detail later in this chapter.) Likewise, the service mesh can provide "golden signals" such as success rates and latencies for services by measuring the traffic to each service. (We'll also discuss this in detail later in this chapter.)

Because a service mesh works by operating and measuring the traffic between services, it is only really useful in a microservice application! Furthermore, many of the service mesh's features are designed to help in the case of synchronous communication like HTTP or gRPC calls between services. If your application is a monolith, or communicates purely via Kafka or another distributed queue, then the service mesh will not provide a lot of value.

The very first service mesh was Linkerd. In the next chapter, we'll look at the history of this project and how the service mesh idea came about.

The above image illustrates a distributed application without a service mesh for observability and security. We can see that there are services communicating with each other, but we don't know anything about that communication.

History of the Service Mesh

The origin of the service mesh is rooted in the technical challenges faced by early adopters of large microservices deployments. As "web-scale" companies such as Twitter, Netflix, and Google scaled out their software to handle huge volumes of traffic, they gravitated towards a microservices architecture: rather than one monolithic service, the application was "decomposed" into many different services, each of which could be scaled, managed, and developed independently.

This introduction of microservices had the effect of dramatically increasing the internal network calls: each service needed to communicate with other services in the application over the network, often in a synchronous manner. (In networking terminology, this type of traffic is sometimes called "east-west" traffic, to differentiate it from the "north-south" traffic flowing into a cluster and "south" through the application to the data store.)

To handle this communication, each company independently created dedicated libraries to handle it: Netflix developed Hystrix, Google developed the Stubby libraries, and Twitter developed a library called Finagle. These libraries handled the communication between services, adding layers of instrumentation and control and allowing this communication to be monitored and controlled as a first-class member of the operational environment.

While these libraries provided a critical solution to the challenge of managing microservice communication, they had some downsides. Each library was specific to a language or runtime, making it difficult to handle polyglot microservices. Additionally, because the libraries were linked into the application, upgrading the functionality of the libraries required redeploying every service that made use of it. Finally, these libraries were often invasive: a service developer would need to ensure that every call to other microservices was made through the library. The service mesh started as a way to address these challenges. The concept of the service mesh came from two key insights. First, that if this core functionality were made available in proxy, rather than library, form, it could be added in a way that was transparent to the application and independent of the language or framework the application was written in. Of course, this would require deploying and managing many proxies—no small feat! The second key insight was that the rise of containers and container orchestrators actually made this operational task feasible: the proxy could be packaged in a container, and the orchestrator used to deploy it uniformly across the application. In short, the push to microservices made the service mesh desirable; the advent of container orchestration made the service mesh feasible.

These two insights gave rise to Linkerd. Linkerd 1.x was introduced to the world in 2016 and was built directly on Twitter's Finagle library. In 2017, the Linkerd project was donated to the Cloud Native Computing Foundation (CNCF) to become the fifth open source project hosted by the foundation, alongside Kubernetes, Prometheus, OpenTracing, and Fluentd.

A year later, in 2018, Linkerd shed its Finagle heritage and was completely rewritten in Rust and Go, to form the dramatically faster and simpler Linkerd 2.0. This modern version of Linkerd will be the topic of this course.

What Does Linkerd Actually Do?

[A note on terminology: for the remainder of this course, we'll be using the terms "microservice" and "service" interchangeably. Our general assumption is that your application comprises multiple services, whether "micro" or not.]

Linkerd provides a lot of features, which can be categorized into three basic categories. These three category make up the key "value props" of a service mesh:

Observability: Collecting real-time telemetry from the system to infer the health of each of the services.
Security: Ensuring that communication between services is confidential, authenticated, and authorized.
Reliability: Ensuring that overall application health is maximized, even in the face of partial failures.

While these features are all different, the underlying mechanics are the same. The proxies in Linkerd's data plane implement the features by controlling the communication between application services. Linkerd's control plane, in turn, coordinates the behavior of these proxies, and allows the operator to control and monitor the mesh and application via CLI, API, and web UIs.

Observability

The term "observability" has become a common part of any conversation that involves distributed systems and applications. But what exactly does it mean? Wikipedia sums it up quite nicely in its definition:

"... observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs."

With this definition in mind, we can derive that a service mesh uses the external outputs of the services in a distributed application to infer the state of those services. One common characterization of the operational state of a service is the concept of "The Four Golden Signals", as popularized by Google. Those signals include:

Latency The length of time it takes to respond to a request. Since each request can have a different latency, the overall latency for a service is usually characterized by the percentiles of the distribution of the latencies of all requests: P50, P95, P99, etc. For example, the P50 of a service is the 50th percentile, or median, latency of its response times, i.e. the latency at which 50% of responses in the time period were equal to or faster than. Traffic The rate of requests being sent to the service, often displayed as requests-per-second (RPS). Errors The portion of successful responses, measured as a percentage of the overall number of requests. Saturation The amount of resource capacity currently consumed by the application. By virtue of where it sits in the stack, Linkerd is able to measure and report these signals without changes to the application code, simply by observing the traffic to and from a service.

Later in this course, we devote an entire chapter to understanding Linekrd's observability features.

Security

Security is a crucial consideration for any application. For cloud native applications especially, there are some particular concerns:

The application may run on a shared resource environment, like a cloud provider, where there is no direct control of the underlying hardware or network. The application may transmit sensitive data, including Personally Identifiable Information (PII), between services. The application may be subject to regulatory requirements around confidentiality of data at rest and in transit. There may be other internal requirements around security best practices. One increasingly common approach to communication security in a cloud environment is the "zero-trust" approach. While a full treatment of zero-trust security is outside the scope of this class, the core goal is to shrink the security boundary of the application to as small and granular a level as possible. For example, rather than having a firewall around a datacenter that enforces security of incoming security, each application in the datacenter might enforce this itself. This zero-trust approach is a natural fit for cloud environments, where the underlying hardware and network infrastructure is not under your control.

The Linkerd security model follows the zero-trust approach by providing transparent mutual TLS communication between services. Mutual TLS (mTLS) is a form of transport security that provides both confidentiality and authentication of communication. In other words, not only is the communication encrypted, but the identity is validating on both sides of the connection. Linkerd implements this at the level of individual Kubernetes pods, allowing each pod to create its own security boundary.

There's a lot more to say about security and we will cover it in greater detail later in the course. For now, it's enough to know that Linkerd will automatically encrypt and decrypt the communication between your services when it is responsible for handling the traffic, will authenticate the identity of both, and will do so in a way that requires no configuration on your part—it's on by default.

One final benefit of this feature is that the business logic in the services no longer has the burden of managing certificates and TLS connections. That means less code to write!

The figure below contrasts two services communicating with each other with and without mutual TLS. When there is no mTLS, the traffic between the services is in plaintext and can be observed by anyone on the same network. When mTLS is enabled between services, each service verifies the identity of the other service by using certificates generated by a common trust root. Once the services verify each other, they encrypt the traffic, so that it is unreadable by anyone other than the two services.

Reliability

Every distributed system must deal with the concept of failure: components can die, networks can "partition" (lose connectivity between nodes), and so on. Generally speaking, a reliable application is one that can successfully serve responses even in the presence of partial failures of some of its components or underlying platform.

Linkerd provides several mechanisms for automatically enhancing the reliability of applications:

Retries Linkerd can be configured to automatically retry requests that have failed for a particular request.

Timeouts When a service takes too long to reply, Linkerd will use a configurable timeout to send an error back to the client instead of waiting indefinitely for the service to respond.

Load balancing Linkerd uses an exponentially weighted moving average algorithm (EWMA) to load balance requests across instances of services, based on latency. We'll explore this in detail later. The main idea to take away is that this functionality distributes traffic evenly across all instances of a service, based on the latency in the response times from each service, thereby ensuring that no one service receives more requests than it can handle.

Traffic shifting Linkerd provides tools that allow operators to employ sophisticated deployment strategies such as canary releases and blue-green deploys. These techniques can help reduce the risk of introducing new code into a production environment.

In our upcoming chapter on reliability, we'll see exactly how to make use of each of these features.

readersclub / linkerd-lf

Ch1 Introduction #2