microsoft / playwright

Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.
https://playwright.dev
Apache License 2.0
66.8k stars 3.66k forks source link

[Question] Visual testing in docker on different CPU architecture #13873

Open nickofthyme opened 2 years ago

nickofthyme commented 2 years ago

Hey team πŸ‘‹πŸΌ

I am working to migrate my puppeteer visual regression testing to playwright.

My team has people working on Macs using with either arm64 (M1 SoC) or amd64 (Intel) CPU architecture.

I'd like a way to run and update playwright tests/screenshots locally from either architecture and have the local screenshots match the screenshots running from the CI (linux/amd64).

Currently we use the mcr.microsoft.com/playwright:vx.x.x-focal docker image to run tests both locally and in the CI. However running on these different architectures produce screenshots that are ever so slightly different when run on a different architecture, virtually imperceptible differences.

Screenshot from M1 mac - arm64

image

Screenshot from Intel mac - amd64

image

Diff screenshot

image

Diff gif

Screen Recording 2022-05-03 at 07 27 31 AM


So my questions is, does anyone have a good strategy to avoid the above errors on these two architectures without reducing the threshold?

I've tried running docker with --platform=linux/amd64 on my M1 mac, but I run into https://github.com/microsoft/playwright/issues/13724#issuecomment-1112358113 when running the tests, even on the latest docker version (v20.10.8) with Rosetta 2 installed. Sounds like this could just be a known issue with docker.

dgozman commented 2 years ago

@nickofthyme No good news, unfortunately. Overall, this sounds like a docker issue. It is expected that arm docker image vs intel docker image produce different screenshots - after all, they have different libraries/executables inside. Ideally, you would force intel image everywhere as you tried, but I guess that does not work as you've linked above.

As for mitigation, we have maxDiffPixels and maxDiffPixelRatio in addition to threshold, but diff image suggests that you'll need pretty big values, so tests will not be that useful.

@aslushnikov Any more ideas?

aslushnikov commented 2 years ago

@nickofthyme any chance you can supply me with a short repro that illustrates this image differences? I'd love to play with it!

nickofthyme commented 2 years ago

Yeah totally! I put up a branch at elastic/elastic-charts#buildkite-arch-test. It's a little clunky to run right now but this should work...

git clone https://github.com/elastic/elastic-charts.git
cd ./elastic-charts
git checkout buildkite-arch-test

Start web server on localhost:9002

yarn install
cd ./e2e && yarn install && cd ../

# generate web server files
yarn test:e2e:generate

# starts local web server
yarn test:e2e:server

Run playwright against local web server on http://localhost:9002/http://host.docker.internal:9002

yarn test:e2e stylings_stories.test.ts

The key part is --platform=linux/arm64 option is e2e/scripts/start.sh that calls docker run with the necessary playwright options. So to generate the diff above you'd need to run this command on both linux/arm64 and linux/amd64 thus changing that --platform flag. The current screenshots are based on linux/arm64.

Let me know if that gives you enough to play with! πŸ‘πŸΌ

aslushnikov commented 2 years ago

Start web server on localhost:9002

@nickofthyme I got stuck here; first, Puppeteer failed to download binary:

[-/10] β‘€ waiting...                                                                                                                                 [9/10] β‘€ puppeteer                                                                                                                                  error /Users/andreylushnikov/tmp/node_modules/puppeteer: Command failed.
Exit code: 1
Command: node install.js
Arguments:
Directory: /Users/andreylushnikov/tmp/node_modules/puppeteer
Output:
The chromium binary is not available for arm64:                                                                                                     If you are on Ubuntu, you can install with:                                                                                                                                                                                                                                                              apt-get install chromium-browser

/Users/andreylushnikov/tmp/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserFetcher.js:112
            throw new Error();

I ran with PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1, but this yielded another error:

[5/5] πŸ”¨  Building fresh packages...
[7/10] ⠐ canvas
[-/10] ⠐ waiting...                                                                                                                                 [-/10] ⠐ waiting...                                                                                                                                 [-/10] ⠐ waiting...                                                                                                                                 error /Users/andreylushnikov/tmp/node_modules/canvas: Command failed.
Exit code: 1
Command: node-pre-gyp install --fallback-to-build
Arguments:
Directory: /Users/andreylushnikov/tmp/node_modules/canvas

With a few dozens of error lines.

nickofthyme commented 2 years ago

You shouldn't need any puppeteer* deps, so if you just remove those from the top-level package.json that should fix things. What device/os/architecture are you trying to run it on?

aslushnikov commented 2 years ago

@nickofthyme I'm on M1 Mac 12.3, running with Node.js 16.13

nickofthyme commented 2 years ago

Interesting that's pretty similar to mine. Let me know if it still gives you trouble and I can try running it on mine from scratch.

aslushnikov commented 2 years ago

Yep, still the same issue with node-pre-gyp

nickofthyme commented 2 years ago

Ok I'm afk but I'll try it tomorrow and let you know.

nickofthyme commented 2 years ago

Ok I removed all references to puppeteer used in old visual testings and canvas used in jest tests, both shouldn't affect playwright tests. I pushed changes to elastic/elastic-charts#buildkite-arch-test.

Since you are on an M1 mac I also pushed a commit (https://github.com/elastic/elastic-charts/commit/dbfd2ed50fa932d5c32b5396a63f34b3761b2f34) updating all screenshots on my intel Mac so you can see the diff from your M1.

I was able to install and run the above commands on my M1 Mac 12.3.1, running with Node.js 16.13.2.

I show running stylings_stories.test.ts file because that included the most problematic screenshot diffs but you can exchange that for any other file under ./e2e/tests/. There are some 1100 tests/screenshots in total so I limit the run πŸ˜…

yarn test:e2e stylings_stories.test.ts

Let me know if you still have troubles with these changes.

uudens commented 2 years ago

Hey @nickofthyme πŸ™‚ I'm wondering what you ended up doing with this? Did you manage to solve it or find some workaround? Thanks!

nickofthyme commented 2 years ago

No not really 😞

We tried everything in the book to make it work to where we could update screenshots locally on either arm64 or amd64 CPU architecture, interchangeably. Regardless of what we did, there was always about 10% of our screenshots that failed with diffs ranging from 0.01% to over 5%. Laxing the threshold to 5% would have defeated the utility of the screenshots so that was never an option for us.

The only way we could make it consistent with 0% threshold, was to just update the screenshot on CI machines via a GitHub PR comment (see https://github.com/elastic/elastic-charts/pull/1819#issuecomment-1245349329) and handle updating the screenshots by committing directly to the PR from CI. Not the most ideal solution but it does the trick for us.

I don't see this changing anytime soon since the two docker images use completely difference binaries and the --platform flag appears to be meant for an interim "best effort" usage, per the docs below.

Some images do not support the ARM64 architecture. You can add --platform linux/amd64 to run (or build) an Intel image using emulation.

However, attempts to run Intel-based containers on Apple silicon machines under emulation can crash as qemu sometimes fails to run the container. In addition, filesystem change notification APIs (inotify) do not work under qemu emulation. Even when the containers do run correctly under emulation, they will be slower and use more memory than the native equivalent.

In summary, running Intel-based containers on Arm-based machines should be regarded as β€œbest effort” only. We recommend running arm64 containers on Apple silicon machines whenever possible, and encouraging container authors to produce arm64, or multi-arch, versions of their containers. We expect this issue to become less common over time, as more and more images are rebuilt supporting multiple architectures.

source: https://docs.docker.com/desktop/mac/apple-silicon/#known-issues

uudens commented 2 years ago

I see.. It is what it is πŸ™ Anyway thank you very much for the summary! :bow:

aslushnikov commented 2 years ago

Hey @nickofthyme, thank you again for the repro. I came back to this issue and can reproduce it reliably.

I don't see this changing anytime soon since the two docker images use completely difference binaries and the --platform flag appears to be meant for an interim "best effort" usage, per the docs below.

Yeah, QEMU emulation is very slow and not advanced enough to run browser binaries, so things are likely to crash / be very slow there. Running docker on M1 was the main motivation for us to provide aarch64 linux browser binaries.

Now, back to your repro. I researched this topic again, and it turns out that chromium rendering is not necessarily deterministic even within a single platform (see https://crbug.com/919955). The differences might happen in the anti-aliasing, so while human eye won't be able to see any difference between images, the images are not pixel-to-pixel perfect. In case of different CPU architectures, the anti-aliasing differences are even more striking and easy to reproduce.

So with this, I draw the following conclusions:

  • Docker environment DOES guarantee structural identity, e.g. layout & fonts will be the same
  • Docker environment DOES NOT guarantee pixel-perfect screenshots

Realistically, though, the anti-aliasing differences are not important for the visual regression testing use cases. So with this, we need a pragmatic approach to discard anti-aliasing differences that might happen sporadically.

So far we have some tools to deal with anti-aliasing noise:

I see you chose to run threshold: 0 for your VRT tests. Do you consider this to be a too-big-of-a-hammer to fight the anti-aliasing artifacts on the screenshots? Why? Have you been bitten by it before?

nickofthyme commented 2 years ago

@aslushnikov Thank you for the incredibly detailed review of this!


Realistically, though, the anti-aliasing differences are not important for the visual regression testing use cases. So with this, we need a pragmatic approach to discard anti-aliasing differences that might happen sporadically.

Agreed, this would be amazing! I looked at differences running pixelmatch directly against a subset of screenshots and found consistent results, though I haven't tested this against screenshots with different cpu architectures.

That being said it would be great if playwright allowed passing all the options defined by pixelmatch and not only threshold as is currently stands πŸ‘‡πŸΌ . This could be useful in fine-tuning the results.

https://github.com/microsoft/playwright/blob/5f03bd9477465d9a6846a31a3cedc58046975989/packages/playwright-core/src/utils/comparators.ts#L61-L63


we have maxDiffPixels and maxDiffPixelsRatio that let you tolerate some pixels.

We have some screenshots with animations that are flaky and produce sporadical results on identical test runs, even when disabling or waiting for animations to be complete. I'm guessing this is due to anti-aliasing because this diff is imperceptible to the eye...

image

So we used maxDiffPixels to resolve this issue, see https://github.com/elastic/elastic-charts/pull/1804.

A side note to this, now that you mention it... It was strange to me that setting the maxDiffPixels option when calling toMatchSnapshot does not clear threshold and maxDiffPixelRatio from the playwright.config.ts which still governs the failing screenshot (i.e. 0.1 > 0). This forces me to manually reset them, see snippet below.

// playwright.config.ts
const config: PlaywrightTestConfig = {
  expect: {
    toMatchSnapshot: {
      threshold: 0,
      maxDiffPixelRatio: 0,
      maxDiffPixels: 0,
    },
  },
}

// some.test.ts
function getSnapshotOptions(options?: ScreenshotDOMElementOptions) {
  if (options?.maxDiffPixels !== undefined) {
    // need to clear default options for maxDiffPixels to be respected, else could still fail on threshold or maxDiffPixelRatio
    return {
      threshold: 1,
      maxDiffPixelRatio: 1,
      maxDiffPixels: 1000000, // some large value
      ...options,
    };
  }
  return options;
}

expect(await page.screenshot()).toMatchSnapshot('landing-page.png', getSnapshotOptions({ maxDiffPixels: 5 }));

The current approach would be sensible to me in most other config override scenarios but in this case all three options are tightly interdependent. Ideally, if I set any of the three directly in the options of toMatchSnapshot the others should be set to default values or the lowest restricted value.

This may just be me, I could see an explanation for both ways, but thought I'd put it out there. πŸ€·πŸΌβ€β™‚οΈ


I see you chose to run threshold: 0 for your VRT tests. Do you consider this to be a too-big-of-a-hammer to fight the anti-aliasing artifacts on the screenshots? Why? Have you been bitten by it before?

I'd say we could probably get away with a looser threshold of say 0.1 with no noticeable impact. The 0 threshold has become somewhat of a prideful aspect of the stability of our VRT setup, starting with puppeteer using 0 which served us very well. That being said, we are testing very detailed canvas element where even the smallest pixel difference could indicate a bug. This would not be on the order of 10's of pixels but we have caught issues where 0.1 threshold would not have been sufficient to signal a bug/failure. Granted I think our case is exceptionally rare, whereas the general DOM use case of toMatchSnapshot a 0.5 threshold would be adequate to signal a failure.

All that being said we do sometimes have large diffs of PRs that require going though 100's of virtually identical screenshots, which is a pain when reviewing files in GitHub!

gselsidi commented 2 years ago

reviving this a bit, as noticing similar issues, few weeks ago tests were passing on different machines with pretty tight tolerances.

But now realizing things are failing. The goal is to create baseline images on local and run them on CI but reading from the above that seems like an issue.

Are rendering issues chrome specific? or any browser based on chromium including firefox? Or maybe we can switch to Safari Webkit? or do they all have rendering issues that don't guarantee visual accuracy across different browsers???

Or is the only solution run docker on 1 cpu architecture locally and it should be ok on CI?

Only way we're able to pass semi accurately on two different OS X machines both M1, is setting maxpixelratio .05 5% which seems fairly high

markov00 commented 2 years ago

Are rendering issues chrome specific? or any browser based on chromium including firefox? Or maybe we can switch to Safari Webkit? or do they all have rendering issues that don't guarantee visual accuracy across different browsers???

I'm pretty sure that there is no possibility to guarantee visual accuracy across different browsers because the rendering engine differs across browsers (with some exceptions) and it could render with slight differences. One over the other is the font rendering I never got the chance to render the same exact text path on two different browsers.

The visual accuracy should be instead guaranteed by rendering on the same CPU architecture and the same browsers and even if you are running the tests on different machines (aka CI and local). We used to run an x86 Docker image on both a macOS (x86 arch) and a different x86 CI with no visual differences.

Instead, here we are highlighting that the difference happens when rendering on the same browser but under two different CPU architecture (e.g. testing a screenshot generated using an x86 chrome-based image with an arm64 chrome-based image).

@gselsidi you mentioned:

Only way we're able to pass semi accurately on two different OS X machines both M1, is setting maxpixelratio .05 5% which seems fairly high

This means that you also get different screenshots when using the same browser and the same CPU arch?

nickofthyme commented 2 years ago

What @markov00 said...

Also 5% maxpixelratio does seem a little high but it really depends on your use case. For ours even a few pixel diff could signal a bug. But for testing the UI on a website, 5% may be just fine to signal desired errors. I'd suggest you run it with the lower accuracy for a bit and see if it misses key errors, then make the call on if/how to increase the accuracy, based on the suggestions above.

gselsidi commented 2 years ago

Don't know anymore screenshots behave weirdly, just created baseline images on mac m1, had a coworker run them on his intel based mac they all passed with super strict .0001 and 0 threshold. Yesterday co worker with same m1 would get constant failures

nickofthyme commented 2 years ago

I don't recall ever seeing inconsistencies in failures (flakiness) for the same screenshots across differing CPU architectures. But I have seen flakiness related to the nature of the screenshot itself specifically when using animations.

aslushnikov commented 1 year ago

@nickofthyme Thank you again for your repro. I was able to get 247 pairs of actual/expected screenshots that all fail for me due to anti-aliasing. Hopefully we'll come up with something to fix this.

Everybody: if you have examples of PNG screenshots that are taken on the same browser and same OS yet are different due to anti-aliasing artifacts, could you please attach the "expected", "actual" and "diff" images here?

This information will help with our experiments with fighting browser rendering non-determinism.

gselsidi commented 1 year ago

this is on docker m1 mac vs docker ubuntu assuming intel just weird artifacts seems to happen 99% of the time with taking screenshots of actual elements, as opposed to the whole page.

Full page screenshots are usually able to pass with an insane .0001% pixelDiffRatio

aslushnikov commented 1 year ago

@gselsidi thanks, added both of these to the testing dataset.


TL;DR: please try out the new experimental "ssim-cie94" image comparator.

The new experimental image comparator is available on alpha version of Playwright. It was designed to combat browser rendering non-determenism. I'd appreciate if you can give it a shot and see if it works for you.

To try it out, first install the alpha version of playwright that has the new comparator:

npm i @playwright/test@1.29.0-alpha-dec-9-2022

The new image comparator is called "ssim-cie94". You can enable it in playwright config:

// playwright.config.ts
import type { PlaywrightTestConfig } from '@playwright/test';
const config: PlaywrightTestConfig = {
  expect: {
    toMatchSnapshot: { _comparator: 'ssim-cie94', },
    toHaveScreenshot: { _comparator: 'ssim-cie94', },
  }, 
};
export default config;

or use in a case-by-case basis:

test('should work', async ({ page }) => {
  await expect(page).toHaveScreenshot({
    _comparator: 'ssim-cie94',
  });
  expect(await page.screenshot()).toMatchSnapshot({
    _comparator: 'ssim-cie94',
  });
});

I'd appreciate both positive and negative feedback. If things don't work for you with this comparator, attaching image samples will drastically help to iterate & improve the comparator.

❀️ thanks

gselsidi commented 1 year ago

@aslushnikov confirmed working! went back down to .0001% diffpixelratio and passing

aslushnikov commented 1 year ago

@gselsidi I hope you can set maxDiffPixelRatio and maxDiffPixels to 0 now. Would you still get any mismatching snapshots? If yes, could you share them with us?

gselsidi commented 1 year ago

Didn’t try 0 haha, will try tomorrow and report back.

gselsidi commented 1 year ago

0 worked! Hey also had a feature request if you can provide a ticket i remember seeing one for future improvements to screenshots.

Allow to global mask elements in config and inside test in test.use sometimes it's useful while there are changes being made to the site to mask new elements ect, and you can update in the future, instead of masking elements at each individual step.

gselsidi commented 1 year ago

@aslushnikov Just had 12 tests fail not sure if page was moving around or not how can i send them to you? would rather not upload them publicly.

gselsidi commented 1 year ago

@aslushnikov Just had 12 tests fail not sure if page was moving around or not how can i send them to you? would rather not upload them publicly.

so fails at 0% with a pixel or 2 off all pass back at .0001%

aslushnikov commented 1 year ago

@gselsidi could you please give me the actual/expected failed samples? I'd love to use them to tune the comparison function.

gselsidi commented 1 year ago

@aslushnikov

yeah how can i give them to you without uploading them here publicly? or should i just upload and delete them after you've saved them?

aslushnikov commented 1 year ago

@gselsidi feel free to send them over to anlushni[at]microsoft[dot]com

aslushnikov commented 1 year ago

Allow to global mask elements in config and inside test in test.use sometimes it's useful while there are changes being made to the site to mask new elements ect, and you can update in the future, instead of masking elements at each individual step.

@@gselsidi Please file a separate feature request for this!

gselsidi commented 1 year ago

did you receive my email with all the files?

aslushnikov commented 1 year ago

@gselsidi just checked - yes I did! The latest version of the comparator passes all your samples though. Could you please check with npm i @playwright/test@1.29.0-alpha-dec-13-2022?

gselsidi commented 1 year ago

I was on dec 09, this new version dec 13 passes. Also what else can we achieve with this?

Would headed vs headless pass?

Non docker images vs docker images pass?

Haven't gotten around to testing it that in depth yet, or would those scenarios still fail?

aslushnikov commented 1 year ago

@gselsidi thank you for checking!

Also what else can we achieve with this?

The new comparator is designed to handle browser rendering non-determinism, so as long as the page layout is exactly the same, the screenshots should pass now.

Do note though that for the layout to be the same, font metrics must be the same as well. Otherwise, line wrapping might happen in different places, boxes will have different sizes, and the images will in fact be different even for human eye.

Would headed vs headless pass?

According to some exploration I did back in the days, headless vs headed rendering differences consist of the following nuances:

  1. different font stacks: this should be fixed as of Dec 2022
  2. different anti-aliasing: this should be fixed by the new comparator
  3. showing vs hiding scrollbars: there's an open bug for this

We didn't aim to fix this though and we didn't experiment yet.

Non docker images vs docker images pass?

This won't work since browsers inside and outside docker will use different font stacks, resulting in different font metrics, resulting in different layout, and finally, yielding visually different screenshots.

pastelsky commented 1 year ago

I ran playwright on all CSS tests from the Web Platform Test Suite, with a threshold of 0.05 and maxDiffPixels of 0. The images were first generated locally on M1 and tests ran on GitHub Actions (Intel), and the results were quite good. I didn't notice any specific failures due to rendering differences between CPU arch (most of the failures were due to race conditions of capture) in the ones I scanned.

The run results are here β€” (Look at the Print URL step for HTML reports)

Pixelmatch WPLAT SSIM WPLAT

With SSIM I did notice a few cases where the actual and expected looked exactly the same, but there was a large diff β€” https://pastelsky.github.io/wplat-playwright-tests/ssim/css-grid/index.html#?testId=2189cda207cfc3edfe7f-82682e3c967e85b9522c

Not sure what's happening here, but looks like the images being displayed aren't the ones actually used for comparison (some race condition?)

The number of failures with a threshold of 0 was a lot more, but it would usually be anti-aliasing issues on the text and around the edges of shapes β€” I doubt setting it that low is realistic.

Caveat: Web platform tests test breadth, and render cases aren't always complex.

pastelsky commented 1 year ago

On a related note, I did notice font stack between the official docker images offered by Playwright, and Ubuntu is quite different. e.g.

(Non-latin) https://pastelsky.github.io/wplat-playwright-tests/ubuntu-change/css-content/index.html#?testId=8e873656b9d1ccb2286a-5929271583149bc0602e

(Monospace) https://pastelsky.github.io/wplat-playwright-tests/ubuntu-change/css-backgrounds/index.html#?testId=afc163fe806768d40fba-6f0e37f2580394646066

It'd be convenient if the official docker image set fonts to something that is available on most Linux machines. Currently, it seems to use an obscure font (WenQuanYi Zen Hei) for rendering monospace fonts.

github-daniel-mcdonald commented 1 year ago

Reposting what I saw here

Was this released in 1.30.0? I'm still not seeing the comparator option after pulling in the latest stable release

Has ssim-cie94 been removed or not yet released?

aslushnikov commented 1 year ago

@github-daniel-mcdonald the comparator is still under experiment / development.

JPtenBerge commented 1 year ago

TL;DR: please try out the new experimental "ssim-cie94" image comparator.

... The new image comparator is called "ssim-cie94". You can enable it in playwright config:

// playwright.config.ts
import type { PlaywrightTestConfig } from '@playwright/test';
const config: PlaywrightTestConfig = {
  expect: {
    toMatchSnapshot: { comparator: 'ssim-cie94', },
    toHaveScreenshot: { comparator: 'ssim-cie94', },
  }, 
};
export default config;

@aslushnikov Thanks for this tip. I've npm install-ed the current latest alpha (1.33.0-alpha-apr-10-2023), but am I correct in that TypeScript doesn't recognize this new comparator property in the config yet? This is of course an easy wockaround:

expect: {
    toMatchSnapshot: { comparator: 'ssim-cie94' } as any,
    toHaveScreenshot: { comparator: 'ssim-cie94' } as any,
},

But as the property is not recognized, it does bring up the question, is the new comparator being used at all or is this just a property that's not read by the Playwright runner? Is there any way by which I can confirm the new comparator being used, anything in the logs or in the HTML report to look for that mentions ssim-cie94?

aslushnikov commented 1 year ago

@JPtenBerge it is currently available as _comparator with underscore and is not officially released yet since we don't have enough feedback to release it.

JPtenBerge commented 1 year ago

@JPtenBerge it is currently available as _comparator with underscore and is not officially released yet since we don't have enough feedback to release it.

Ah, great, that's doing something!

To other readers: _comparator is also not recognized as a property in my TypeScript file, I'm still using any here, but when I enter non-existent comparators like this:

expect: {
    toMatchSnapshot: { _comparator: 'loremipsum' } as any,
    toHaveScreenshot: { _comparator: 'loremipsum' } as any,
},

It clearly gives an error: Error: Configuration specifies unknown comparator "loremipsum", so it does appear to be used.

@aslushnikov: Thanks!

aslushnikov commented 1 year ago

@JPtenBerge I'd appreciate if you could share feedback on how it goes for you; things that worked and things that didn't work!

JPtenBerge commented 1 year ago

@aslushnikov Will do. Have some other work to focus on first, but I hope to experiment a bit further somewhere later on this week.

michalstrzelecki commented 1 year ago

Hi! I am also using ssim-cie94 comparator and struggle for a while. I was updating PW from a beta build released in december to the 1.36. It was very hard to find that comparator was renamed to _comparator.

aslushnikov commented 1 year ago

@michalstrzelecki I'm sorry for the inconvenience. This is an experimental feature that has never been released or documented.

I'd also encourage everybody to share their experience with the ssim-cie94 comparator. Your feedback is crucial when it comes to releasing / sunsetting this experiment.

nickofthyme commented 1 year ago

Finally had a chance to test this and found a pretty large reduction in failures (~83%) but not yet good enough to move away from single arch testing/updating. See https://github.com/elastic/elastic-charts/pull/2110#issuecomment-1644416593.

Most failures with ssim-cie94 are small <0.001 diffs related to fonts, lines or svgs, from a cursory review.

@aslushnikov make sure to download the html reports before they expire.

aslushnikov commented 1 year ago

@nickofthyme thank you so much for trying out the comparator!

I'm going over the HTML report, and so far majority of the failures actually look legit to me.

Two tests caught my eye: goal_stories.test.ts and stylings_stories.test.ts. These are failing, but could be fixed if we make the anti-aliasing detection algorithm slightly more vague. Let us consider exposing this setting to control anti-aliasing aggressiveness.