rstudio / shinytest2

https://rstudio.github.io/shinytest2/

Other

105 stars 18 forks source link

`shinytest2` approach to saving/comparing images #4

Open schloerke opened 3 years ago

schloerke commented 3 years ago

This issue is to propose a new approach to how images are store/compared in an attempt to dramatically reduce the number of snapshot file, reducing overall confusion.

Current approach:
- Store all snaps in _snaps/OS-RVersion/TEST/XYZ.{json,png}
- Checking:
  - Given an OS/RVersion, the json and images much match perfectly
- Pros:
  - Get json/images for each and every OS/RVersion combo
  - Can have different, known output for different OS/RVersions
- Cons:
  - More files to maintain. Becomes unwieldily very quickly with many tests
  - Hard to see differences between images within the same test
New approach:
- Store all snaps in _snaps/TEST/XYZ.{json,png} (No variant)
- Store the master version meta information in _shinytest2.json. Info such as OS and RVersion.
- Checking:
  - If the OS/RVersion match exactly, then all images / json should match perfectly
  - Else, all images/json should be able to fuzzy match
    - Image fuzzy matching should be similar to https://github.com/MangoTheCat/visualTest . Inspiration: https://github.com/rstudio/shinytest/issues/412
    - JSON:
      - Maybe the HTML SHA should never be compared?
      - Maybe the image SHA should never be compared?
      - Maybe we fuzzy match the information?
- Pros:
  - Minimal number of files to maintain; 1:1 ratio of json/images to expected snapshots
  - Do not let snapshots from other platforms slip through the cracks as long as the master version exists. Example using the "current approach" above: Init on mac and then test on linux. This situation will never throw on differences However, using fuzzy matching, you can init on mac and then immediately test on linux using fuzzy matching. Shinycoreci-apps ran into this due to maintenance fatigue and accepting images right away. Then not being able to compare json/images between OS/RVersions allowed the image to survive longer than it should have.
- Cons:
  - Exact images for all flavors are not known. Only the master version
  - Small display issues in other OS/RVersions will not be caught. Ex: Let's say linux displays sliders using dark grey, while the master version displays them with light grey. The fuzzy match will say that it is tolerable, while it might actually be an issue. If this is the case, the tolerance could be set to 0, but would have be manually done.
st2_expect_html():
- Testing functions like st2_expect_html() would also be bundled into this situation.
- I would argue that st2_expect_html() should not have fuzzy matching by default. Maybe it should be run through tagQuery() (or something similar) to produce consistent output. Ex: attribs$class names and attribs keys are in a consistent order.

@wch What are your thoughts?

I believe this is a win/win in that we get less images to maintain and we also get to fuzzy match when testing on differing OS/RVersions. (Exact matching on the master OS/RVersion; Fuzzy matching on all other OS/RVersions.) This would also allow for Linux machines to be bold and not bold and not provide false-positive results in shinycoreci-apps.

cc @MadhulikaTanuboddi

schloerke commented 3 years ago

Talking with @jcheng5 , it is good to be reminded of the questions we are trying to address.

Questions we are trying to address

How does the app look for everyone when implemented?
- Shiny team members; Possibly package authors (radiant?); rstudio/shinycoreci-apps is trying to implement this
- Requires all combinations of platform and R versions
- Images:
  - Do they all look similar enough to each other to pass?
Does my logic work / app appear?
- Local development using one platform (macOS), most likely hosted on another platform (linux).
- Images:
  - "Does my app logic work?" Being off by a pixel or two doesn't really matter. Ex: Small font differences
(I can't remember the others, but will edit them in later)

Notes for future fuzzy matching (if implemented), Fingerprinting is good for comparing image X to many other images. However in our situation, we have two images in our hands. Better heuristic might be to look at the diff matrix of the two images and draw conclusions from there. Ex: > 2% of image is a failure, If a 20px square is found to be different it is a failure, etc...

schloerke commented 3 years ago

Talking through these situations over dinner, we need to consider the full set of steps of resolving an error.

Approach: Store all variants as truth

Needs
- Details are important
- Fuzzy matching could introduce false-negatives (says image is "ok", when it is not). This would be bad.
Setup:
1. Test on each variant on GHA
2. User calls function to pull down images from GHA
3. Images are merged automatically as there is no dissenting knowledge
Regular testing
1. Test on each variant on GHA
2. When failure occurs, user then initiates a "fix images" call
  1. Downloads artifacts from GHA
  2. Compares new images to known images
  3. Keeps the correct images to use moving forward

Approach: Store single variant as truth and fuzzy match others

Needs
- Worried about app logic, not that bslib adjusted the css a little. Either the images are close or they are very far apart
- Fuzzy matching is great as it allows other the liberty to make small changes
Setup:
- Locally, set the known test images
Regular Testing
1. Test on other variant on GHA
2. Let's say a variant image mis-match is found; Only considering a false-positive ("bad" image is not actually "bad")
  1. Downloads artifacts from GHA
  2. Compares bad images to known images
  3. Adjust rules to not flag as a false positive

Using the new approach...

Will we be able to see the differences?
- Barret: I think so
Will we be able to address the differences?
- Barret: This will require comparison rules that can easily be altered. The user will need to balance their rules to reduce false-positives while trying to prevent letting actual wrong differences being marked as ok.

Known testing problems:

Hard to compare similar images across all variants (15 images!)
Minor font changes happen a lot on Ubuntu on GHA
- Causes many false-positives when testing images on all variants
Fuzzy matching probably not find a "missing Chinese characters" on Ubuntu on GHA.

Proposal

Hybrid approach: Allow for both a single variant and all variants and each approach should allow for tolerances.

If variant = NULL
- Snaps are stored on the top level. _snaps/st2_mac_4-0_NAME.png
- If the local variant matches the info in the file name
  - Image must match; Follow local_variant_rules; Ex: Tolerance = 0
- If the local variant does not match the info in the file name
  - Image must match; Follow other_variant_rules; Ex: Tolerance <= 2% difference
If variant = SOMETHING
- Snaps are store within the variant folder. _snaps/mac_4-0/NAME.png
- Image must match; Follow local_variant_rules; Ex: Tolerance = 0

I believe the above approach will address most situations and give freedom to users / developers.

schloerke commented 3 years ago

Because the OS/RVersion is so important, I believe there should be two different methods and not tied to variant.

(Poorly named) Example:

st2_expect_platform_snapshot(app); st2_expect_snapshot(app, os_platform = TRUE)
st2_expect_original_snapshot(app); st2_expect_snapshot(app, os_platform = FALSE)

rpodcast commented 2 years ago

This is arguably one of the most complex issues in the general shiny testing workflow, and one unfortunate problem with my day job environment is that our internal RStudio Workbench and RStudio Connect servers are run on RHEL 7, while running CICD is through GH Actions. It's not looking like GH Actions will support RHEL or CentOS anytime soon on the Linux side, so I won't be able to do any robust evaluations of plot or app screenshots if the test is run on GH.

(Poorly named) Example:

* `st2_expect_platform_snapshot(app)`; `st2_expect_snapshot(app, os_platform = TRUE)`

* `st2_expect_original_snapshot(app)`; `st2_expect_snapshot(app, os_platform = FALSE)`

I like this approach, and would be interested in collaborating on a solution to this. I know it is likely too complex for the initial release, but it is an issue I'm motivated to solve.

schloerke commented 2 years ago

I like this approach, and would be interested in collaborating on a solution to this. I know it is likely too complex for the initial release, but it is an issue I'm motivated to solve.

@rpodcast Awesome, I'll take you up on that for another release!

Yes, I'd like for this feature to live in {testthat} as {shinytest2} snapshots are only a vehicle to produce {testthat} snapshots.

Current thought process:

Have a GHA step to upload a folder with a deterministic name. This will avoid pushing to GitHub and and branch conflicts.
- Maybe the current "Only if testthat fails" artifact is good enough?
Have an R function download all artifacts for a particular git SHA. Default to last pushed SHA.
Have an R function merge all artifact folders.
Diff viewer should be updated to handle multi-variant snapshot review.
- I should be able to accept/reject individually and also accept/reject the whole set
- Types:
  - Text
    - Ex: If variants A,B,C all change file foo.json, then I want to see text diffs for all files at the same time.
  - Images
    - Ex: If variants A,B,C all change file foo.png, then I want to see PNG diffs overlayed on all of each other or like a gif switching between all photos
  - Binary
    - Ex: Changing a PDF or a zip file

Current questions:

How should non-variant snapshots be handled with multiple artifacts folders?
- Ex. artifact-A has a change, artifact-B has a different change, artifact-C removes the file, and artifact-D had no change. Since all artifacts are pointing to the same origin file, this will cause merge conflicts.

schloerke commented 2 years ago

Update to

Because the OS/RVersion is so important, I believe there should be two different methods and not tied to variant.

(Poorly named) Example:

st2_expect_platform_snapshot(app); st2_expect_snapshot(app, os_platform = TRUE)

st2_expect_original_snapshot(app); st2_expect_snapshot(app, os_platform = FALSE)

app$expect_values() saves a JSON file of value content and will save a screenshot that will never fail an assertion but should be checked into GitHub. This extra debug screenshot will be visible when viewing diffs, but will never cause a failure. Only the accepted debug screenshot should be kept in Git. New debug screenshots should be .gitignore'ed.

app$expect_screenshot() will cause a failure if the captured screenshot is any different. This method should only be used as a last resort as it is very brittle. variant in AppDriver$new(variant=) is required when calling app$expect_screenshot().

Both methods will listen to the app's variant value provided at initialization of app <- AppDriver$new(variant=).

I am expecting users to not use variant and to not use app$expect_screenshot().

I am expecting the Shiny team or other UI packages to use app$expect_screenshot() with many variants.

rpodcast commented 2 years ago

I am expecting users to not use variant and to not use app$expect_screenshot().

I am expecting the Shiny team or other UI packages to use app$expect_screenshot() with many variants.

I will want to discuss that assumption in more detail, as my previous projects around shinytest involved ensuring the app at least rendered "something" to the user, while not necessarily the contents themselves (which may be handled more directly by comparing values comprising the output such as a table). Hence my ideal will be to have an assertion that something appeared, Once I begin to migrate a very large-scale shinytest effort from a few years ago, I'll be able to add more context.

danielinteractive commented 2 years ago

Very naive question: I guess there is no way to do vector graphics comparisons a la vdiffr in this case, right?

schloerke commented 2 years ago

@danielinteractive Unfortunately no.

vdiffr transforms ggplot2 objects into <svg /> representations as there is a 1:1 conversion between the underlying {grid} layout and what can be produced with an <svg />. In addition all of the CSS information is embedded into the <svg />.

While {shinytest2} does have access to the page <html />, we would have to ask for which CSS information we would like to also retrieve for every node. It would work and asking for most of the CSS names, it would prolly be fairly useful. This would prolly work for anything that isn't binary across platform. We expose $expect_html() to perform the <html /> comparisons. I thought the computed CSS information would be too overwhelming and brittle, so I did not include it as a method.

However, this text representation will fail when R uses a different font for its plots as the text representation is hidden in the image. So we would not be able to compare byte for byte on two plot images across platform. (We currently have this issue, and that is why we suggest using variant = platform_variant() to maintain multiple images.)

schloerke commented 2 years ago

@rpodcast Good news. I've been green-lighted to work on a {testthat} snapshots. This includes snapshot comparisons for GitHub Actions. It'll be a bit before work starts, but it will come eventually! The goal will be to automate the retrieval of any {testthat} snapshots on GHA to be compared locally (given the latest GHA run). 🤞

Until that is implemented, then the approach to comparing images is not complete and will keep the issue open.

shinytest2 approach to saving images has been completed and is explained here.

In addition, fuzzy image comparisons is implemented in this PR.

Alik-V commented 1 year ago

Hi @schloerke, Thank you for this package! I've been playing with it and started running into the problems described in this issue where the tests created on a windows machine do not line up with linux runners via GHA. I've seen that there is a merged PR with fuzzy comparison - is it implemented in latest dev version of the package? Is there documentation I could peek into describing how to use the new functionality?

schloerke commented 1 year ago

@Alik-V Glad you like the package!

Check out the docs for AppDriver$expect_screenshot(). There is a bug in v0.2.0 on CRAN that throws an error if the images are different and threshold is set. The difference value is more of a distance calculation, rather than a T/F calculation. Ex: A transparent white pixel compared to a solid black pixel will have a total difference of 4 as all four of the RGBA channels are 100% different. Where as a slightly off pixel due to rounding errors by Chome will have a small distance, ~= 3 * 1 / 256 ~= 1%.

However, I do not expect image rounding it to help that much that images taken on Windows will match images taken on Linux. If you have custom fonts and no base plots, it might work but I'm not hopeful.

Example usage:

Allow for 2 / (5 * 5 * 4) = 2% difference any 5x5 kernel.
Allow for a 10% difference in a big kernel. 3000 / (3x RGB channels * 100 * 100) = 3000 / 30000 = 10%. I would not trust for most applications as the percentage/area is too large, which could allow for true-positive changes to be missed.

Alik-V commented 1 year ago

@schloerke Thank you for explanations and pointing towards the documentation!

At the moment, I have opted out for using platform = platform_variant() and recording both windows and linux tests results (linux recorded through workbench) separately, but it would certainly be easier I was able to use a single results folder through platform = NULL. I will play with thresholds, I like this approach to snapshot testing.