sourcegraph / gophercon-2018-liveblog

Documents how to write a great liveblog post and how to submit your post for the GopherCon 2018 liveblog hosted by Sourcegraph at https://sourcegraph.com/gophercon
1 stars 2 forks source link

From Prototype to Production: Lessons from Building and Scaling Reddit's Ad Serving Platform with Go #24

Closed attfarhan closed 6 years ago

attfarhan commented 6 years ago

Presenter: Deval Shah

Liveblogger: Farhan Attamimi

How Reddit built its ad-serving system using Go, and the lessons learned from the process.

Summary

The Reddit engineering team recently introduced Go into their stack to write a new ad-serving system to replace a third party system. Deval Shah talks us through the architecture of the new service, the Reddit team's experience using Go for the first time, and all the lessons they've learned from using Go to build this ad server.


Introduction to Reddit

Reddit is the frontpage of the internet. It's a social network with tens of thousands of interest communities, where people go to discuss the things that matter to them.

Reddit by the numbers:

Any system that Reddit builds must scale to handle this level of traffic.

Ads Architecture Overview

The ads server handles the entire ads flow. Everything from the selecting advertisements to show to any post-processing after the ad is shown to the user, is handled by the ads server.

Ad Serving @ Reddit

There are several requirements for the Reddit ad server:

Ad Serving @ Reddit before:

Before, whenever a user went to reddit.com, the reddit monolith backend would send a request to a third-party ad server. The third-party server would respond with one or more ads that it selected, and that gets returned to the user.

image

After a while, they realized that continuing to use the third-party ad server wouldn't work for them moving forward because it was:

decided to build an ad server, built a team of 3 people. started with infra, then wrote the services, and then rolled it out to prod. the system it is now in production.

Ad Serving Infrastructure: apache thrift for all RPCs - around since 2007, one of the first rpc protocols. been using this since the very beginning. every system we build must be able to talk thrift rocksDB for datastorage - OSS keyvalue store bt facebook. embeddable data store, avoids network hop otpimizd for highreads and writes. decided to use Go as the main backend language. - quite obvious in hindisght. was the first time in REddit to use Go in prod. mostly python adn java in reddit. wanted to make sure Go would ea first class citizen in the languages reddit had, and supported everything reddit needed.

image

Ad Server Architecture:

This is the architecture of the new ad server: image

A brief overview of how it works:

In this architecture, the Go services are:

Some other Go tools and services at Reddit that won't be covere in-depth:

Our experience with Go

This is the first experience Reddit has had with Go. Deval says the experience has been great so far. The effort started with two to three engineers using Go, and it has now grown to around a dozen engineers working on Go.

The main advantages they've seen with Go are:

Lessons learned

This is a set of 5 problems faced, how Reddit dealt with them, and the learnings from these challenges.

Problem 1: How to build production ready microservices?

Reddit had prior experience doing this for Python, but not in Go.

The initial prototype worked, by way of lots of StackOverflow reading and Googling, but it was clearly not going to scale with developers.

Some issues they saw were:

They realized the Go community had solved these problems, so they looked at existing frameworks that had solved these problems. Some options they encountered:

image They decided that Go-Kit made the most sense. The main reasons Reddit picked Go-Kit were that it:

Go-Kit @ Reddit. This is a diagram of the enrichment service using Go-Kit:

image

There are some things to note about this architecture. The center service has 2 implementations: an in-memory implementation (this was good and used for the prototype), and a RocksDB implementation which was used in the production implementation. The in-memory implementation still exists for local development.

There are several middleware layers: tracing, logging, and metrics. Finally, the Thrift transport is at the top level. This structure makes it easy to make changes. For example, if they wanted to change the transport layer from Thrift to gRPC, they'd only need to change the top layer.

Using Go-Kit was beneficial because it gave the team a good exmaple on how to structure Go code. They didn't have experience in this before, so using Go-Kit was helpful for understanding the typical structure for Go services.

Lesson 1: Use a framework/toolkit. Not neccesarily for everything you use Go for, but for production services that require metrics, logging, and so on, use libraries that have solved the problem rather than trying to do it yourself.

Problem 2: How to roll out the new system safely and quickly?

The ultimate goal was to roll out the new ad server with minimal impact to Reddit users, paying advertisers, other internal teams reliant on the ads team. The third party ad server was a black box, and Reddit needed a way to iterate rapidly, learn, and get better.

It was like changing airplanes mid-flight. They slowly added the new infrastructure around their third-party service, and when it was ready, they would rip it out.:

image

How did Go help with this? Go allowed them to make the move to the new ad server safely and easily, aided by these Go characteristics:

Lesson 2: Go makes rapid iteration easy & safe.

Problem 3: How to debug latency issues?

Once the new ad server was deployed, they did see some slowness, network glitches, bad deploys etc.

pprof is great if you know exactly which service is having issues. Distributed tracing, on the other hand, gives you visibility across services. They didn't have support for distributed tracing on the ads side, but they did have support for it elsewhere on Reddit's stack.

Why is tracing useful?

image

Tracing is usually easy. You have a client and server. On the client side, you extract trace identifiers, and inject them into the request youre sending to the server. On the server side, when you get a request and identifiers, you put them into a context object and pass them around. This is very straightforward using HTTP and gRPC, and there's no reason not to do this.

image

But, reddit was dealing with Thrift, so they ran into some problems.

They took a look at Thrift alternatives, Facebook Thrift and Apache Thrift. The two key features they were looking for were support for headers and context objects: image

They tried using FB thrift but there were some issues, mainly that the lack of a context object required messy workarounds, leading to messy code and complications. In Apache thrift, the context object was supported, but it doesnt have support for headers. So, the solution: add headers to Apache Thrift. This has been done for other languages, but not for Go. So, they added THeader to Apache Thrift. This means context objects are now supported, and headers can store trace identifiers.

If you want to see these changes, you can check out https://github.com/devalshah88/thrift. Deval hopes to get the changes through the contributing process and merge it upstream.

Here's a look the tracing code. The client wrapper just extracts out trace information from the context object, and adds it to the headers: image

The server wrapper takes information from the headers and injects it into the context object so it can be passed around: image

This code is from https://github.com/devalshah88/thrift-tracing.

Having done all this work, distributed tracing proved to be very useful in debuggging latency issues. The takeaway, however, is lesson 3: Distributed tracing with Thrift and Go is hard.

Problem 4: How to handle slowness/timeouts?

At Reddit, they want systems to handle slowness gracefully. They never want users to suffer, so if there is slowness, Reddit would rather not show ads than degrade the user experience.

The two goals they had are:

Use the context object to enforce timeouts within a service: This is the code from the enrichment service of adding a deadline to context object, passing it through, and exiting early if deadline expires. image

This result of this is good, but not enough: image The first graph shows how long it took to get responses from the enrichment service. This particular time frame had some slowness, but it did not let users wait longer than 25ms.

The second graph shows that on the server side, the enrichment service was processing the request for up to 70ms, so the server was wasting resources doing work after the client had already timed out and didn't need a response anymore.

What typically would be done is to propagate deadlines with HTTP. This code adds a timeout, which is passed to the server through the context object: image

Thrift makes this hard. There is no context object used here. If the client times out, the goroutine doesnt know that and doesnt exit: image

This is not great, but there are ways of fixing this:

One option is to add a deadline to the request payload. The client needs to include the deadline in the request. The server would inject deadline into the context object, and use it. This wasn't great because this change had to be made in all endpoints.

Instead, they passed the deadline as a thrift header. This is similar to how they pass trace identifiers. After this change, on the enrichment server side, they saw latencies similar to client side: image

Lesson 4: Use deadlines within and across services.

Problem 5: How to ensure new features don't degrade performance?

Rapid iteration and complex business logic can lead to performance issues. The ad service team needed processes and tools to ensure they could move fast without violating the latency SLA. To do this, they made use of load testing and benchmarking.

Load testing using bender: image

This is what using you'd get in repsonse from Bender: image

Load testing is really useful for testing changes under heavy load, and lets developers optimize new features for high load before pushing to production.

They also make use of benchmarking for all critical systems. This benchmarking code: image Gets you this output: image

Benchmarking helps by:

Lesson 5: Benchmarking and load testing is easy. Do it!

Recap:

  1. Use a framework/toolkit
  2. Go makes rapid iteration easy and safe
  3. Distributed tracing with Thrift and Go is hard
  4. Use deadlines within and across services
  5. Use load testing and benchmarking

Conclusion:

ryan-blunden commented 6 years ago

This is now live:

Tweet: https://twitter.com/srcgraph/status/1034913979758403584 Post: https://about.sourcegraph.com/go/gophercon-2018-from-prototype-to-production-lessons-from-building-and/