Book: SE@Google Ch 11: Testing Overview

Book: Software Engineering at Google Chapter: 11 (Testing Overview)

Main summary:

The main objective of tests is to catch bugs and also build confidence in the code for developers to be able to make product changes quickly.
There are many benefits to testing, including less bugs, time saved, better documentation, cleaner code.
Good tests should take into account the test size and scope. Small tests run fast and are more deterministic, and can more easily pinpoint errors in bigger tests.
A good test suite should have large number of unit tests, intermediate number of integration tests, and small number of E2E tests. This allows for small and integration tests to run easily throughout the day, and E2E tests during build or deployment.
When introducing tests to an organization, slowly introduce the benefits. Engineers must be aware of the value and embrace tests themselves. Enforcing rules, such as code coverage, is a good guide but not fully effective and may backfire if engineers do not believe in the tests.

Full Summary

Motivations

There are 2 main motivations for testing:

It is exponentially more expensive to catch bugs later in the development cycle.
Regression testing gives developers the confidence to make changes and create new features.

Creating tests should also be efficient. Should it not, engineers may find workarounds, and a bad test suite is worse than no testing at all.

Google Web Server

In the early days of Google, testing was of little priority, and the worst culprit was the Google Web Server (GWS). GWS is responsible for serving Google's search queries.

In 2005, as the project scaled, it faced slower and buggier releases, and developers lost confidence in the service. Many developers faced bugs only in production, and that led to 80% of releases to be rolled back.

The technical lead instituted automated testing, and it dropped emergency pushes by 50% within a year, in spite of a record number of new changes. Today, GWS has tens of thousands of tests, and continues to release every day relatively bug-free.

GWS sees tests as contributing to the collective wisdom of the team. With each new test, it allows other members to benefit of not having to dig around with a debugger.

Modern Testing

With the increasing complexity and size of software systems, human software testers are unable to keep up by manually validating each behavior. Hence, it has led to the automation of tests.

However, it is difficult to have a good testing process. we should:

Write tests as code, allowing it to be run easily on machines
Write test for different environments, such as on different browsers or languages.
Have a process such that failing tests are rectified immediately.

Benefits of a good testing process:

Less time spent by engineers debugging
Less time spent by reviewers to verify the correctness of the behavior
Better documentation as the behavior of the system is enforced and shown by tests
More confidence in making changes to the behavior of the system
Thoughtful design: Tests encourage refactoring, as it's easier to see if behavior changes. Furthermore, tests encourage code modularity to make it easier to test. Both of these reasons result in simpler and cleaner code, with less refactoring later.
Quality, fast releases: More confidence and faster releasing.

Designing a Test Suite

Important considerations about testing

Test Size

Instead of traditional "unit" or "integration" tests, Google encourages to write the smallest possible test. However, size is not determined by line of code, but by how it runs, what's it allowed to do, and amount of resources. For example,

Small tests run in a single process
Medium tests run in single machine
Large tests run wherever they want

Google focuses on the speed and determinism of the tests. Smaller tests will run faster and more deterministically than other tests, and hence engineers can place more restrictions on them. With bigger tests, restrictions can be relaxed.

Small tests: Some important restrictions include not being able to sleep, perform I/O operations, blocking calls, and accessing network of the disk. Should tests need to bypass the restrictions, they need to use test doubles (dependency injection), replacing the actual dependency with a lightweight dummy dependency. This encourages faster and more deterministic tests, allowing Google to run hundreds of these tests throughout the day.

Medium tests: These tests are not allowed to make external network calls, other than the local machine's localhost. This makes integration with other dependencies more reliable. However, working with other processes are hard to guarantee, especially on different operating systems. Hence, engineers need to be more careful with medium tests.

Large tests: Ran on a remote cluster, usually for full-system end-to-end tests. Google runs large tests only during build and release processes so as not to impact developer workflow.

Non-deterministic flaky tests are expensive. Flaky tests:

Require additional computational resources to re-run
Have engineers spend time to investigate them
Lose confidence in the test suite. In the author's experience, when engineers lose confidence due to flaky tests, they will stop reacting to test failures.

To make tests easy to read and implement, Google hence also strongly discouraged control flow statements and loops in a test.

Test Scope

Test scope refers to the amount of code being validated in a test. It is important to note that this does refer to the amount of code executed, but the amount of behavior enforced.

Narrow-scoped (unit tests) Validates the logic in small, focused part of codebase. Encompasses 80% of tests.

Medium-scoped tests (integration tests) Verify interactions between a small number of components. Encompasses 15% of tests.

Large-scoped tests (functional, end-to-end, system tests) Validate between distinct parts of system. Encompasses 5% of tests.

The 80% / 15% / 5% model creates a good "pyramid" shape for distribution of tests. It allows you to quickly narrow down failing end-to-end tests quickly using fast-running unit tests.

Antipatterns for test distribution shapes

Ice cream cone: Large number of end-to-end, and little unit tests. Slow and difficult to run. Usually non-deterministic, and is difficult to trace the source of the bug. Usually found in rushed projects
Hour-glass: Little number of integration, large number of unit and E2E tests. This usually results in E2E tests failing, but it's hard to diagnose and catch. The medium-scope tests would be faster to run and pinpoint the exact problems. The pattern occurs when tight coupling makes it difficult to use test doubles (e.g mocks) to test individual dependencies.

Beyoncé Rule: What to test?

“If you liked it, then you shoulda put a test on it.”

The rule encourages to test everything that you don't want to break. This includes performance, behavioral correctness, accessibility, security and error handling.

Code Coverage as a Metric

Code coverage is usually measured by the number of lines ran during tests. However, setting a minimum amount of code coverage usually leads to engineers treating it as a ceiling, instead of a floor/ A better way would be to think about behaviors that are enforced.

Testing at Google's Scale

Google uses a monolithic repository, where all code in stored in a single repository. Furthermore, no repository branches are used, and changes are committed immediately to head.

Pitfalls of a large test suite

Poorly written tests include brittle tests, which over-specify outcomes. An example would be making a small change, leading to dozens of unrelated tests failing. Sources of brittle tests include misuse of mock objects.
Larger suites mean taking more time to run tests, which may mean it's less frequently ran.
Tests may have unnecessary speed limits, such as using sleep().
History of Testing at Google

Google's 3 key initiatives helped introduced automated testing to the company in 2005. The Testing Grouplet had considered asking for a mandate for testing to senior executives, but quickly decided against it and believed in slowly demonstrating it's success and spreading awareness of testing.

Orientation Classes

As Google was rapidly expanding, the company wanted to target new hires, as they would rapidly outnumber the existing staff. Hence, as part of the orientation program, they added an hour-long discussion of the benefits of automated testing. The key point was that they introduced it as standard practices of the company, not knowing that the new hires would be pioneering it at their new teams. This resulted in new projects implementing the practice of testing, while pre-existing projects had an overwhelming number of engineers that support the initiative.

Test Certified

The Testing Grouplet devised a certification program called 'Test Certified' for projects. Organized into 5 different levels, it gave projects concrete actions to improve their testing certifications within the current review cycle, conveniently fitting into the project's internal planning schedule. Furthermore, a internal dashboard applied social pressure to all teams, leading to different teams competing with one another.

Testing on the Toilet

The Testing Grouplet wanted to raise awareness of testing, and was considering through emails. However, a joke spawned the idea of putting posters in restroom stalls:
1. The bathroom is a place everyone must visit at least once a day
2. It could be cheaply implemented

Today's Testing Culture

As a replacement of the 'Test Certified' in 2015, Google launched 'Project Health' (pH).

It automatically gathers dozens of metrics on the health of a project, including test coverage and test latency, and makes it available. Similar to the 5 levels of 'Test Certified', 'Project Health' also has a similar rating, and all projects continuous build automatically gets a pH score.

Limits of Automated Testing

Some metrics or behaviors of the system is difficult to test, and are left to human testing. For example, Google uses 'Search Quality Raters' to execute real queries and record impressions. Furthermore, nuances of audio and video quality are hard to be measured in automated tests.

Furthermore, humans are better at testing in certain areas, also called 'Exploratory Testing'. It refers to a creative process which treats system under test as a puzzle to be broken, using unexpected data or set of steps. For example, complex security vulnerabilities are better discovered by humans, and then added to automated security testing systems.

nus-cs3281 / 2024