tuskyapp / Tusky

An Android client for the microblogging server Mastodon
https://tusky.app
GNU General Public License v3.0
2.5k stars 390 forks source link

Screenshot tests #3869

Open nikclayton opened 1 year ago

nikclayton commented 1 year ago

RFC: Introduce screenshot testing

tl;dr

The rest of this document explains how I came to these conclusions.

Why do I want to do this?

GitHub has support for viewing (some) image diffs in the browser, and seeing the visual diffs between the before and after versions of the image. This is described at https://docs.github.com/en/repositories/working-with-files/using-files/working-with-non-code-files#rendering-and-diffing-images, and can make reviewing PRs that contain UI changes very pleasant.

Note: https://github.com/sergio-sastre/Android-screenshot-testing-playground has a lot of excellent material on this topic.

Choosing a screenshot file format

Screenshots have to be saved in some format. The choice of format will affect the size of the repository and the ease of working with the diffs.

A usable format must be:

Some initial research suggests that candidate formats are PNG, WEBP, JPEG-XL, and AVIF.

Spoiler: We can only use PNG. But the investigation was still useful.

File sizes

I looked at the size of all of them, with PNG as the baseline.

To get a sense of likely image sizes I took a screenshot of my home timeline from the emulator. The size of this file was 498,499 bytes.

For each additional format (WEBP, JPEG-XL, AVIF) I first converted the source PNG to the target format. I then compared the two files using

magick compare original.png new-file -compose Src diff.png

and verified that diff.png showed no differences.

In addition, naive PNG images can include additional unnecessary information, tools like pngcrush can be used to remove that and shrink the file size further. So I compared the original PNG file with one processed by pngcrush as well.

Comparing the different filesizes gave the following results:

Format Filesize (bytes) Notes Ratio
PNG 498,499 Original image 1.00
PNG 460,314 pngcrush 0.92
WEBP 288,272 cwebp 0.57
WEBP 281,750 ImageMagick (1) 0.56
AVIF 715,629 avifenc 1.43
JPEG XL 220,589 cjxl 0.44

The commands used to do the conversion:

pngcrush:

pngcrush -brute -rem tEXt -rem tIME -rem iTXt -rem zTXt -rem gAMA \
  -rem cHRM -rem iCCP -rem sRGB -rem alla -rem text \
  original-screenshot.png output-pngcrush.png

Note: This was a very slow process, taking ~ 60 seconds

WEBP:

cwebp -lossless -o output.webp original-screenshot.png

AVIF:

avifenc --lossless --speed 0 original-screenshot.png output.avif

JPEG XL:

cjxl original-screenshot.png original-screenshot.jxl --quality 100 \
   --effort 9 --brotli_effort=11

Github support

Unfortunately, although both WEBP and JPEG XL are considerably smaller than PNG, Github does not support WEBP or JPEG XL at this time. Viewing a WEBP or JPEG XL file in the browser at GitHub does not offer a preview, only an option to download the file. The image diff view functionality does not work either.

https://github.com/orgs/community/discussions/5470 is the open request for GitHub to support WEBP.

https://github.com/orgs/community/discussions/18139 is the open request for GitHub to support JPEG XL.

File format conclusion

Until GitHub supports WEBP or JPEG XL, PNG is the only usable format. As soon as GitHub does support one of those other formats the saved screenshots can be converted with no loss of quality and a ~ 45-55% space saving.

Repository impact

Size estimations

With PNG as the file format it's possible to do some back-of-the-envelope calculations on the impact this would have on the repository size.

The app consists of approximately 60 distinct screens (~ 30 activities, ~ 30 fragments).

Each of those can be displayed in two different orientations, and three different colour themes.

That's 60 x 2 x 3 = 360 distinct states to test, and doesn't include additional states we will want to check (e.g., displaying a particular dialog, or menu, or displaying a status in bookmarked and un-bookmarked start). So that's a conservative lower bound on the number of screenshots.

That also assumes screenshot tests are taken against a single device. We will probably want to test against devices that represent several different screensizes in the future, so this number could easily triple or quadruple.

360 screenshots at ~ 500KiB = ~ 180,000 KiB or ~ 175MiB.

And this doesn't consider the repository growth as images are modified over time. Or the impact if we are able to migrate from PNG to JPEG XL or WEBP at some future point in time.

Options for storing these images are:

Directly stored in the main repository

GitHub recommends repositories are ideally less than 1GB, and definitely less than 5GB (https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github#repository-size-limits).

Assuming the file sizes are in the right ballpark this is within their limits, although adding additional devices to the tests could prove problematic in the future.

Directly storing images in the main repository is also a decision that's hard to walk back from. While moving images to a different repository or storage format is possible, the history of the images prior to deletion will remain in the repository, and if we need to remove that history for performance reasons it requires some moderately invasive surgery on the repository, with an impact on every open PR (https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository).

Submodules

It is possible to store the images in a separate repository and use Git submodules to make them available. For example, the JPEG XL repository does this to store their reference test data in a different repository (see the testdata/ directory at https://github.com/libjxl/libjxl).

However, this considerably complicates the developer workflow if they are working on a single PR that affects files in the main repository and a submodule, and the opportunity for mistakes is high.

LFS

Git Large File Storage is the third option, and I believe it's the most appropriate.

On that last point, I created a test repository configured to track PNG files on LFS, https://github.com/nikclayton/lfs-test.

https://github.com/nikclayton/lfs-test/blob/main/test.png is the test PNG file, stored on LFS.

https://github.com/nikclayton/lfs-test/commit/d4ad7e52a85282a0ade5f7a09d72363b7bde517e is a commit that changed that file, showing GitHub's image diffing support works for that file.

How to perform screenshot tests?

This is summarised from https://github.com/sergio-sastre/Android-screenshot-testing-playground

I'm aware of five well-supported and actively developed different libraries that could be used to implement screenshot tests.

  1. Cashapp Paparazzi
  2. Dropbox Dropshots
  3. Shot from pedrovgs
  4. Facebook screenshot-tests-for-android
  5. Ndtp android-testify

Cashapp Paparazzi

Paparazzi is probably the fastest at running tests; it doesn't test whole activities or fragments, it tests individual views, on the JVM, locally. This is much faster than launching tests on emulators or real devices.

This is great for verifying that a change to a view hasn't resulted in an unexpected change to that view. But it doesn't verify the view's appearance in-situ.

Paparazzi can also only test views that are part of library projects, not application projects. This would require a reorganisation of the Tusky code to move most of it in to a library (or several libraries) with a much smaller application project that imports from the library.

This might be worth doing anyway, it would have a positive effect on incremental build times. But that's a separate discussion.

The Paparazzi issue tracking support for application modules is https://github.com/cashapp/paparazzi/issues/107.

Dropbox Dropshots

Relatively new, although last update was 4 months ago (at the time of writing).

Possibly does too much? It claims to do the image verification and the generation of comparison images for review, which might not be necessary in our case (e.g., image comparisons can be seen in Android Studio with a plugin like https://plugins.jetbrains.com/plugin/12691-image-diff).

Claims to run the tests on the device, for better integration with Android Studio. Other libraries run the code to generate the screenshots, then download the screenshots from the device to compare.

Facebook's screenshot-tests-for-android

Doesn't work on Windows hosts. Since that's what I develop on it's a non-starter.

Shot

The oldest of the libraries I believe. Last significant release was May 2022.

Ndtp

Of all of them appears to the one under most active development.

Has an Android Studio plugin for easy integration.

Possible concerns

Does this make it harder to submit PRs?

No.

The typical workflow stays the same. You write code, test, etc, and then submit a PR.

If you have made UI changes that affect the screenshots this PR may include those changes, but it does not have to.

Note: Typically I expect that the PR author will include the screenshot changes; if you've been working on the UI and you have an easy way to verify that your changes are only affecting what you expect, why wouldn't you use that, and run the screenshot tests locally before you submit the PR? Or at least a subset of them that tests one device / orientation / theme combination. But this does not have to be enforced before the PR is submitted.

Irrespective of whether the PR includes new screenshots or not, when the PR is made (and on every subsequent commit to the PR) a GitHub action starts a workflow that:

The PR author can then use GitHub's built in image-diff capabilities to review the changes between their branch and the screenshot branch.

If their changes are anticipated they merge that branch in to the PR, and everything is good.

If their changes are not anticipated then the tests have discovered a regression, and they need to fix it.

https://proandroiddev.com/no-emulator-needed-for-screenshot-tests-of-compose-previews-on-ci-d292fa0eb26e

If there are no changes

How long will it take for the tests to complete?

Don't know yet. That's why an experiment is worthwhile.

What about false positives / flaky tests?

In my experience screenshot tests do introduce a potential for false positive / flaky test failures.

This is where a test passes (or fails) on one run, and then has the opposite result on the next run despite no code changing.

Common fixes for screenshot test flakiness include:

All the libraries mentioned attempt to deal with this.

There are also reports of the Android emulator generating different screenshots when run on different operating systems (i.e., you can get different screenshots on Linux / MacOS / Windows).

That can be mitigated by treating the screenshots generated on CI as "golden". See https://github.com/sergio-sastre/Android-screenshot-testing-playground section "Rending problems" for more on this.

Tak commented 1 year ago

I have extensive experience with screenshot testing, and I believe that it's extreme overkill for this kind of app.

It introduces extra dependencies and adds complexity to contribution, project management, and CI, to (in the ideal case) catch a type of bug that occurs relatively rarely in Tusky.

IMO this energy would be better spent doing things like adding standard test coverage to areas where it's lacking (currently a large part of the codebase) or hardening Tusky against nonstandard api responses (a frequent source of (easily fixable, but often only discovered via crash analytics) real bugs).

connyduck commented 1 year ago

I am using screenshot tests professionally and I would also advise against using them in Tusky.

Paparazzi + Jetpack Compose is ok-ish, everything else more of a hassle than it is helpful.

nikclayton commented 1 year ago

It introduces extra dependencies and adds complexity to contribution, project management, and CI, to (in the ideal case) catch a type of bug that occurs relatively rarely in Tusky.

I looked at the most recently merged 100 PRs (going back to Jun 11). Of those, the following all fixed user-visible UI bugs:

That's 11%

7 of those were regressions from previously working behaviour that testing and PR review did not find.

Of those last 100 PRs, 38 were translations or release process commits (i.e., bumping the versionCode). If those are excluded then it's 18% of recent PRs that have corrected issues that screenshot tests would have caught.

Based on that I don't think the assertion that these are bugs that "occur relatively rarely in Tusky" is accurate.

It introduces extra dependencies and adds complexity to contribution, project management, and CI,

As outlined above, it does not need to add significant complexity to contribution, as screenshots can be generated on CI and attached to PRs so contributors do not face an extra burden.

IMO this energy would be better spent doing things like adding standard test coverage to areas where it's lacking (currently a large part of the codebase) or hardening Tusky against nonstandard api responses (a frequent source of (easily fixable, but often only discovered via crash analytics) real bugs).

This is not an either-or proposition. There is nothing preventing you from submitting PRs that address those or other areas.

Lakoja commented 1 year ago

I don't see much regressions* in the list of 11 issues? The apng problem (#3631).

nikclayton commented 1 year ago

I don't see much regressions* in the list of 11 issues? The apng problem (#3631).

* Something that worked before.

The 7 regressions (and/or issues that should have been caught when the initial feature was merged) are:

nikclayton commented 1 year ago

https://github.com/tuskyapp/Tusky/issues/3973 is an example of issue where screenshot tests would make verifying the fix much easier, as the fix involves replacing how HTML is converted to spanned text.

https://github.com/tuskyapp/Tusky/pull/3921 is an open PR that would also be a lot easier to review with screenshot tests, as it would be trivial to verify that the color resource changes did not cause any unexpected UI changes.