linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.59k stars 1.27k forks source link

Kafka metrics support #2214

Open grampelberg opened 5 years ago

grampelberg commented 5 years ago

What problem are you trying to solve?

HTTP based traffic is only one type of modern applications. Many use message queues such as Kafka. Getting the metrics for consumers/producers/messages are just as critical to application health as requests and responses in HTTP.

How should the problem be solved?

Implement a Kafka codec that allows the Linkerd proxy to introspect Kafka's protocol and provide metrics for the communications between consumers and producers. This should show up as a CLI command, dashboard and visualization of the topology between message consumers and HTTP actors.

Related Issues

abhishekchauhn commented 5 years ago

Hi, I am interested to work on this issue and contribute to it as a GSoC project. Although I am new to Go and Rust programming languages, I do have a prior experience working with Kafka and have a good understanding of the it's architecture and working.

It would be great if I can get some help in getting started.

grampelberg commented 5 years ago

@cAbhi15 you bet! The first step would be to start getting to know the codebase. This one is primarily on the rust side of things, so you'll want to check out how the codecs work with tokio. You'll also want to dig into the metrics and how that all works! @hawkw have any good pointers for good first steps?

Before you end up implementing anything or throwing up a PR, it'd be great to see a design doc. I can work with you on that and the structure of it.

hawkw commented 5 years ago

It's probably worth looking into previous attempts at writing a futures-based async Kafka codec in Rust; it looks like there's currently a prototype implementation but it doesn't seem complete or production-ready.

The codec implementation is probably going to comprise a majority of the work on this project --- integrating it with the proxy and adding metrics is fairly simple in comparison. To avoid having to do the whole thing from scratch, we might want to determine whether that's something that we can use, perhaps involving a conversation with the maintainers about what's necessary before the library is ready for production use?

A familiarity with async programming in Rust is going to be vital, so learning Rust & getting to know the proxy codebase are probably good first steps if this isn't something you're already familiar with. Any proxy issues with the good first issue tag might be good starting points.

zaharidichev commented 5 years ago

@hawkw Is the HTTP codec living mostly in https://github.com/linkerd/linkerd2-proxy/tree/master/src/proxy/http ?

grampelberg commented 5 years ago

The H2 codec is at https://github.com/carllerche/h2

hawkw commented 5 years ago

@zaharidichev The linkerd2_proxy::proxy::http module mostly consists of HTTP-specific proxy behaviours; the actual protocol implementation and codecs are provided by libraries.

Linkerd 2 uses the library hyper (maintained by @seanmonstar), which implements clients and servers for HTTP/1 and HTTP/2. As @grampelberg mentioned, the HTTP/2 protocol implementation is h2 (maintained by @carllerche) for HTTP/2, which hyper uses to support HTTP/2.

zaharidichev commented 5 years ago

@hawkw Thanks for the information. This is indeed quite an interesting issue.

Nethminiromina commented 5 years ago

Hi, I am also interested in this project for GSoC 2019. Can anyone share a project proposal template? Thank you

carlosb1 commented 5 years ago

Same, for me. I am interested in the project!

grampelberg commented 5 years ago

@Nethminiromina we don't have a specific template, a writeup on what you're planning on doing and implementation details would help a ton.

carlosb1 commented 5 years ago

I was checking the code and I share some impressions to help. I understand linkerd-proxy implements all the business logic but to work with grpc (tower-grpc) and protobuf / proto, it exists linkerd-proxy-api. Furthermore, I found an useful example for tower-grpc that I refactorized it to set up a template, my example.

ruig2 commented 5 years ago

Hi, I'm a PhD student at the University of California, Irvine and I'm also interested in this project. I went through the codes and documents of Linkerd2 and Kafka briefly, and my idea about this project is as following:

Target Enable Linkerd2 to proxy messages that follow the Kafka protocol

Current condition

Plan to solve the issue Since 1) Linkerd2 can proxy TCP and 2) Kafka is based on TCP, we only need to add a Kafka decoder / codec to decode the bytes so that we can analyze and provide metrics accordingly. Our first choice would be to use a sophisticated 3rd party decoder.

Tools/libraries that can help

Code-level plans

@grampelberg and @hawkw Please correct me if anything wrong. Any ideas or comments?

olix0r commented 5 years ago

Hi @ruig2! Thanks for getting involved.

For the sake of prototyping, it might be possible to pull in an existing kafka client, but for anything that we'd want to accept into master, we'd want to use something that fits more tightly into our runtime.

Specifically, we need a Rust-based Kafka codec (i.e. no unsafe code) that exposes Kafka's messaging primitives over tokio's async i/o & execution model. (If you want to see an especially complex protocol implementation, check out h2).

I'm not too familiar with Kafka's protocol details, but hopefully we should not need to implement any of the "complicated" kafka features in the proxy--those can be provided by the app's kafka clients--but we need to be able to receive decoded kafka messages from a tokio server and be able to send them on a tokio client.

I expect that the existing HTTP proxying (and metrics, routing, etc) logic will not be generally reusable with for kafka proxying... though we'll also want to fit into tower, layers which is (effectively) how we compose linkerd's internals as a set of layers... See, for example, how we construct clients to the control plane.

Then, we can implement a metrics middleware that sits in a kafka-stream-forwarding stack that looks something like the above.

Once we know what the codec library looks like, it should be a lot easier to describe how a metrics layer should be implemented.

Hope this helps.

ruig2 commented 5 years ago

@olix0r Thanks for your detailed response. Now two things on top of my mind:

  1. Read roughly about Kafka protocol, and evaluate/verify the Rust-based Kafka clients to see which one we can adopt;
  2. Check Linkerd2 codes to know how to call the Kafka decoder in Linkerd2 and how to do metrics when proxying.

Maybe my next step would be to do a prototype (as you mentioned earlier, maybe with a not-so-safe decoder) and document it in my proposal to increase my chance to be admitted into the GSoC program. Any idea on how I can do better?

olix0r commented 5 years ago

@ruig2 When you start going down the path of figuring out how to prototype with tokio, the tokio gitter is a great resource.

ruig2 commented 5 years ago

@olix0r @grampelberg @hawkw Good news: I've implemented a prototype to recognize Kafka request header via Tokio.

When you connect a Kafka client to the prototype, you can see something similar to the following:

Received a connection

bytes: [0, 0, 0, 20, 0, 18, 0, 2, 0, 0, 0, 6]
Request Size 20
Request api_key 18
Request api_version 2
Request correlation_id 0

where api_key 18 means ApiVersions Request (https://kafka.apache.org/protocol#The_Messages_ApiVersions).

The related Rust code of the prototype is at https://github.com/ruig2/kafka-codec/blob/master/tokio/examples/detect_kafka.rs#L41-L45 .

I'll polish and document how to run the prototype and the Kafka client later. Currently, the prototype won't do anything but read the first 12 bytes in the stream and decode it according to the Kafka protocol. Lots of work to do (e.g. forwarding the TCP traffic to the Kafka server, handling the case when it is not Kafka, ...), and this prototype can be a very good start to me.

ruig2 commented 5 years ago

A draft of my proposal is created at https://docs.google.com/document/d/1cQeQ5GMyTFAuytygGE281OJJjsLunUImZNm_uppKg-Q/edit . Looking forward to your comments, and feel free to tell me if more information is needed.

halcyondude commented 4 years ago

Is there any update or movement on this issue? We are migrating workloads involving meshed Kafka clients and have just started investigating possibilities.

ruig2 commented 4 years ago

@halcyondude The report of the GSoC project is at https://github.com/ruig2/KafkaRustCodec/blob/master/gsoc.md . I did some investigation about meshed Kafka clients this summer, however, it is not yet ready for the production environment.

Please let me know if I can help ;)

loxp commented 4 years ago

Hi, I am interested in this issue for Community Bridge 2020. Is there any preliminary task for it?

grampelberg commented 4 years ago

@loxp you'll want to jump into slack and get an RFC together. We'd love to get to know you a little better too, if you could work through a getting started issue that'd be awesome!

JTarball commented 3 years ago

Hi, Is there any updates on this topic?

ruig2 commented 3 years ago

@JTarball Hi, this project was part of the Google Summer of Code 2019. The related codebase is at https://github.com/ruig2/KafkaRustCodec and hope this is helpful to you.

rewt commented 3 years ago

@JTarball Hi, this project was part of the Google Summer of Code 2019. The related codebase is at https://github.com/ruig2/KafkaRustCodec and hope this is helpful to you.

Do you have any docs on how to implement?

tychedelia commented 2 years ago

Hi all. I've started maintaining a full implementation of the Kafka protocol and am going to try to tackle this as a side project to level up my Rust.

I've just started exploring the proxy codebase, so don't have a ton of thoughts right now, but here are some outstanding questions I'm pondering:

I will continue to post here as I continue to dig and discover more. This is mostly for my own personal edification rather than needing this to be accepted upstream, but I will hopefully open some WIP pull requests to get feedback if people have to time to review / evaluate viability of making this "a thing." Way too early to say now.

olix0r commented 2 years ago

@tychedelia If I were to tackle this, I'd probably take the following approach:

  1. Ignoring the implementation, what are metrics/labels that would be worth tracking? I'd probably start by trying to mock out prometheus metrics as would be exported once this is all done.
  2. How does a proxy know whether it's transporting Kafka vs another protocol? We'll need to update the Server CRD to include a proxyProtocol variant for Kafka. And then we'll need to update the proxy api to include these hints for both inbound policy as well as the destination service's GetProfile protocol hinting.
  3. Finally, we'd update the inbound and outbound proxies to take advantage of this hinting. I'd probably start on the inbound (server) side.
  4. I don't think load balancing will apply, as I assume Kafka clients will never initiate connections to a service IP. Instead, I'd expect the client to only ever initiate connections to individual pods.