Open nikclayton opened 1 year ago
I have extensive experience with screenshot testing, and I believe that it's extreme overkill for this kind of app.
It introduces extra dependencies and adds complexity to contribution, project management, and CI, to (in the ideal case) catch a type of bug that occurs relatively rarely in Tusky.
IMO this energy would be better spent doing things like adding standard test coverage to areas where it's lacking (currently a large part of the codebase) or hardening Tusky against nonstandard api responses (a frequent source of (easily fixable, but often only discovered via crash analytics) real bugs).
I am using screenshot tests professionally and I would also advise against using them in Tusky.
Paparazzi + Jetpack Compose is ok-ish, everything else more of a hassle than it is helpful.
It introduces extra dependencies and adds complexity to contribution, project management, and CI, to (in the ideal case) catch a type of bug that occurs relatively rarely in Tusky.
I looked at the most recently merged 100 PRs (going back to Jun 11). Of those, the following all fixed user-visible UI bugs:
That's 11%
7 of those were regressions from previously working behaviour that testing and PR review did not find.
Of those last 100 PRs, 38 were translations or release process commits (i.e., bumping the versionCode). If those are excluded then it's 18% of recent PRs that have corrected issues that screenshot tests would have caught.
Based on that I don't think the assertion that these are bugs that "occur relatively rarely in Tusky" is accurate.
It introduces extra dependencies and adds complexity to contribution, project management, and CI,
As outlined above, it does not need to add significant complexity to contribution, as screenshots can be generated on CI and attached to PRs so contributors do not face an extra burden.
IMO this energy would be better spent doing things like adding standard test coverage to areas where it's lacking (currently a large part of the codebase) or hardening Tusky against nonstandard api responses (a frequent source of (easily fixable, but often only discovered via crash analytics) real bugs).
This is not an either-or proposition. There is nothing preventing you from submitting PRs that address those or other areas.
I don't see much regressions* in the list of 11 issues? The apng problem (#3631).
I don't see much regressions* in the list of 11 issues? The apng problem (#3631).
* Something that worked before.
The 7 regressions (and/or issues that should have been caught when the initial feature was merged) are:
https://github.com/tuskyapp/Tusky/issues/3973 is an example of issue where screenshot tests would make verifying the fix much easier, as the fix involves replacing how HTML is converted to spanned text.
https://github.com/tuskyapp/Tusky/pull/3921 is an open PR that would also be a lot easier to review with screenshot tests, as it would be trivial to verify that the color resource changes did not cause any unexpected UI changes.
RFC: Introduce screenshot testing
tl;dr
The rest of this document explains how I came to these conclusions.
Why do I want to do this?
GitHub has support for viewing (some) image diffs in the browser, and seeing the visual diffs between the before and after versions of the image. This is described at https://docs.github.com/en/repositories/working-with-files/using-files/working-with-non-code-files#rendering-and-diffing-images, and can make reviewing PRs that contain UI changes very pleasant.
Choosing a screenshot file format
Screenshots have to be saved in some format. The choice of format will affect the size of the repository and the ease of working with the diffs.
A usable format must be:
Some initial research suggests that candidate formats are PNG, WEBP, JPEG-XL, and AVIF.
Spoiler: We can only use PNG. But the investigation was still useful.
File sizes
I looked at the size of all of them, with PNG as the baseline.
To get a sense of likely image sizes I took a screenshot of my home timeline from the emulator. The size of this file was 498,499 bytes.
For each additional format (WEBP, JPEG-XL, AVIF) I first converted the source PNG to the target format. I then compared the two files using
and verified that
diff.png
showed no differences.In addition, naive PNG images can include additional unnecessary information, tools like
pngcrush
can be used to remove that and shrink the file size further. So I compared the original PNG file with one processed bypngcrush
as well.Comparing the different filesizes gave the following results:
The commands used to do the conversion:
pngcrush:
WEBP:
AVIF:
JPEG XL:
Github support
Unfortunately, although both WEBP and JPEG XL are considerably smaller than PNG, Github does not support WEBP or JPEG XL at this time. Viewing a WEBP or JPEG XL file in the browser at GitHub does not offer a preview, only an option to download the file. The image diff view functionality does not work either.
https://github.com/orgs/community/discussions/5470 is the open request for GitHub to support WEBP.
https://github.com/orgs/community/discussions/18139 is the open request for GitHub to support JPEG XL.
File format conclusion
Until GitHub supports WEBP or JPEG XL, PNG is the only usable format. As soon as GitHub does support one of those other formats the saved screenshots can be converted with no loss of quality and a ~ 45-55% space saving.
Repository impact
Size estimations
With PNG as the file format it's possible to do some back-of-the-envelope calculations on the impact this would have on the repository size.
The app consists of approximately 60 distinct screens (~ 30 activities, ~ 30 fragments).
Each of those can be displayed in two different orientations, and three different colour themes.
That's 60 x 2 x 3 = 360 distinct states to test, and doesn't include additional states we will want to check (e.g., displaying a particular dialog, or menu, or displaying a status in bookmarked and un-bookmarked start). So that's a conservative lower bound on the number of screenshots.
That also assumes screenshot tests are taken against a single device. We will probably want to test against devices that represent several different screensizes in the future, so this number could easily triple or quadruple.
360 screenshots at ~ 500KiB = ~ 180,000 KiB or ~ 175MiB.
And this doesn't consider the repository growth as images are modified over time. Or the impact if we are able to migrate from PNG to JPEG XL or WEBP at some future point in time.
Options for storing these images are:
Directly stored in the main repository
GitHub recommends repositories are ideally less than 1GB, and definitely less than 5GB (https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github#repository-size-limits).
Assuming the file sizes are in the right ballpark this is within their limits, although adding additional devices to the tests could prove problematic in the future.
Directly storing images in the main repository is also a decision that's hard to walk back from. While moving images to a different repository or storage format is possible, the history of the images prior to deletion will remain in the repository, and if we need to remove that history for performance reasons it requires some moderately invasive surgery on the repository, with an impact on every open PR (https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository).
Submodules
It is possible to store the images in a separate repository and use Git submodules to make them available. For example, the JPEG XL repository does this to store their reference test data in a different repository (see the
testdata/
directory at https://github.com/libjxl/libjxl).However, this considerably complicates the developer workflow if they are working on a single PR that affects files in the main repository and a submodule, and the opportunity for mistakes is high.
LFS
Git Large File Storage is the third option, and I believe it's the most appropriate.
git-lfs
if they want to work with any of the screenshots.On that last point, I created a test repository configured to track PNG files on LFS, https://github.com/nikclayton/lfs-test.
https://github.com/nikclayton/lfs-test/blob/main/test.png is the test PNG file, stored on LFS.
https://github.com/nikclayton/lfs-test/commit/d4ad7e52a85282a0ade5f7a09d72363b7bde517e is a commit that changed that file, showing GitHub's image diffing support works for that file.
How to perform screenshot tests?
I'm aware of five well-supported and actively developed different libraries that could be used to implement screenshot tests.
Cashapp Paparazzi
Paparazzi is probably the fastest at running tests; it doesn't test whole activities or fragments, it tests individual views, on the JVM, locally. This is much faster than launching tests on emulators or real devices.
This is great for verifying that a change to a view hasn't resulted in an unexpected change to that view. But it doesn't verify the view's appearance in-situ.
Paparazzi can also only test views that are part of library projects, not application projects. This would require a reorganisation of the Tusky code to move most of it in to a library (or several libraries) with a much smaller application project that imports from the library.
This might be worth doing anyway, it would have a positive effect on incremental build times. But that's a separate discussion.
The Paparazzi issue tracking support for application modules is https://github.com/cashapp/paparazzi/issues/107.
Dropbox Dropshots
Relatively new, although last update was 4 months ago (at the time of writing).
Possibly does too much? It claims to do the image verification and the generation of comparison images for review, which might not be necessary in our case (e.g., image comparisons can be seen in Android Studio with a plugin like https://plugins.jetbrains.com/plugin/12691-image-diff).
Claims to run the tests on the device, for better integration with Android Studio. Other libraries run the code to generate the screenshots, then download the screenshots from the device to compare.
Facebook's screenshot-tests-for-android
Doesn't work on Windows hosts. Since that's what I develop on it's a non-starter.
Shot
The oldest of the libraries I believe. Last significant release was May 2022.
Ndtp
Of all of them appears to the one under most active development.
Has an Android Studio plugin for easy integration.
Possible concerns
Does this make it harder to submit PRs?
No.
The typical workflow stays the same. You write code, test, etc, and then submit a PR.
If you have made UI changes that affect the screenshots this PR may include those changes, but it does not have to.
Irrespective of whether the PR includes new screenshots or not, when the PR is made (and on every subsequent commit to the PR) a GitHub action starts a workflow that:
The PR author can then use GitHub's built in image-diff capabilities to review the changes between their branch and the screenshot branch.
If their changes are anticipated they merge that branch in to the PR, and everything is good.
If their changes are not anticipated then the tests have discovered a regression, and they need to fix it.
https://proandroiddev.com/no-emulator-needed-for-screenshot-tests-of-compose-previews-on-ci-d292fa0eb26e
If there are no changes
How long will it take for the tests to complete?
Don't know yet. That's why an experiment is worthwhile.
What about false positives / flaky tests?
In my experience screenshot tests do introduce a potential for false positive / flaky test failures.
Common fixes for screenshot test flakiness include:
All the libraries mentioned attempt to deal with this.
There are also reports of the Android emulator generating different screenshots when run on different operating systems (i.e., you can get different screenshots on Linux / MacOS / Windows).
That can be mitigated by treating the screenshots generated on CI as "golden". See https://github.com/sergio-sastre/Android-screenshot-testing-playground section "Rending problems" for more on this.