feat: better support for visual regression testing

aslushnikov commented 3 years ago

Playwright Test has a built-in toMatchSnapshot() method to power Visual Regression Testing (VRT).

However, VRT is still challenging due to variances in the host environments. There's a bunch of measures we can do right away to drastically improve experience in @playwright/test

[ ] support for docker test fixture to run browsers inside docker image.
[ ] support for blur in matching snapshot to counteract antialiasing
[x] better UI for reviewing snapshot diffs

Interesting context:

migration from backstopjs to @playwright/test

florianbepunkt commented 3 years ago

I think https://github.com/americanexpress/jest-image-snapshot provides a nice suite of options for various VRT scenarios. Test scenarios vary widely, depending on the context (testing components, whole pages, text-heavy or not, etc).

Besides bluring which helps a lot with antialiasing it would be nice if multiple image comparisons (e. g. SSM) would be possible. Alternative image comparison algorithms could be left to userland, if they can be plugged into toMatchSnapshot via a common interface.

aslushnikov commented 3 years ago

Besides bluring which helps a lot with antialiasing it would be nice if multiple image comparisons (e. g. SSM) would be possible.

@florianbepunkt What's SSM? Is it structural similarity measurement (SSIM)?

florianbepunkt commented 3 years ago

@aslushnikov Yes, typo.

kevinmpowell commented 3 years ago

Solid integration with Storybook would be beneficial for the work I do. Chromatic and Percy do this really well.

Also a UI for reviewing the diffs would be great.

aslushnikov commented 3 years ago

Also a UI for reviewing the diffs would be great.

@kevinmpowell What's the one that you find most handy? Is it a "slider" diff like here:

kevinmpowell commented 3 years ago

I actually prefer the pixel highlighting (like Playwright already does), but organize all the failing tests in a UI so I can see what failed without having to poke around three different images.

Also being able to A/B toggle the baseline and the test image is nice in some cases.

kevinmpowell commented 3 years ago

Slider is rarely useful for me. An onion-skin (transparency overlay) would be more useful.

AlexNetman commented 3 years ago

@aslushnikov Why toMatchSnapshot() is not available in the documentation? It can not be found in API list. And the article that was in 1.13 https://playwright.dev/python/docs/1.13.0/test-snapshots is not available for 1.14 anymore.

Thanks for thinking about Visual Regression testing. Thats important!

florianbepunkt commented 3 years ago

On a related note: It would be great if tests could be run cross-plattform. Currently the os platform name is baked into the snapshot filename, so our CI tests sometime fail due to name miss-match. https://github.com/microsoft/playwright/issues/7575

lo1tuma commented 3 years ago

support for blur in matching snapshot to counteract antialiasing

It would be nice if we could choose whether we want to apply such image filters before the snapshot is being saved or only when doing the comparison. I would prefer the first option as it keeps the diff small when creating new snapshots even of such images that change randomly / are flaky.

ts-23 commented 3 years ago

Please allow an auto-generated filename when toMatchSnapshot has no name input, similar to how toMatchSnapshot works in Jest.

[ ] Auto-gen filename when name not specified for toMatchSnapShot
[ ] Set default toMatchSnapshot file extension in playwright.config.ts

E.g.

// foo.spec.ts toMatchSnapshot() => foo.spec.ts.snap (default extension customizable in playwright.config.ts)

When you have a lot of screenshot assertions in one file, we can avoid writing a lot of filename inputs:

sergioariveros commented 3 years ago

Thanks for thinking on this, blur feature is something that will help us, we have something similar before with puppeter that help us to do comparisson in animated pages, in addition to that something that can be really useful is be able to ignore specific parts of the screen, specially in those parts where we have more dynamic data(videos/images)

Doug-Bowen commented 3 years ago

Blur would help us greatly. Also, the slider view would be incredible as well.

z0n commented 3 years ago

We're also really interested in these improvements. We had to disable visual tests for now because they are randomly failing because a few pixels are off, even when increasing the threshold. Blur should help here hopefully.

damaon commented 2 years ago

I suggest solving biggest pain-point which is how to store this stuff in git repo so it doesn't blow up in size (to store only last snapshot). Git LFS kinda works but it's painful. Maybe something else would work better? For reference: https://github.com/americanexpress/jest-image-snapshot/issues/92

Would be great if these snapshot dirs were automatically marked in git to only store last revision.

z0n commented 2 years ago

We're using Git LFS, what's your issue with it? Once we had it set up for everyone (we're using Mac, Windows and Linux), it worked fine. We're storing all images in the repo using Git LFS (*.png) so there's no work involved when adding snapshots to new tests either.

The only issue I have is comparing the image diffs in VS Code when committing new images as the old image is not shown in the diff view. The diff is working fine in the GitLab merge request view though so that's not a big issue.

z0n commented 2 years ago

Hi @aslushnikov! This was pushed to the next version a few times now, could you please add this to the roadmap (if there is one?) so we can have a rough estimate on when this is coming?

I need to implement some visual tests soon™️ and it would be great if I wouldn't need another tool for that. I need to know if there will be improvements to this in 2 months or 2 years though.

aslushnikov commented 2 years ago

Hey @z0n, there's no roadmap. My guesstimate is that we'll have all the pieces together by summer 2022, the priority of VRT keeps raising.

damaon commented 2 years ago

We're using Git LFS, what's your issue with it?

It works for me but for example wanted to use it in one company that had poor infra and it didn't worked well with Jenskins for example, so I couldn't easily bypass it.

Also Git LFS worked weird with rebases and people had a lot of trouble with it when jumping between branches if I remember correctly.

It works but experience is suboptimal.

aslushnikov commented 2 years ago

Hey folks! Here's an update on screenshots and blurring.

I see lots of you requested a "blur" option to pre-blur images before comparison. While I imagine it can help with certain issues, it's a very big hammer, so I wonder if we can do a more delicate job.

I'd appreciate if you could share screenshots (actual / expected) that fail for you with regular diff, but pass with preblur. This way we'll have some real-world data to play with!

aslushnikov commented 2 years ago

Many folks mentioned that they want pre-blur to avoid snapshot failures due to a few pixel differences.

A new options has landed on tip-of-tree: pixelCount and pixelRatio. These a supposed to help in these cases. Please give them a try and let me know, if you still need preblur!

$ npm i @playwright/test@next

shamrin commented 2 years ago

Thank you for improving visual regression features, @aslushnikov!

You may find the implementation experience of gemini-testing project useful. Some pointers:

Several years ago we've great success using Gemini for visual regression testing. We used Gemini built-in web UI (either Gemini GUI or html-reporter - don't remember which) to choose changed images worth committing to Git. And during PR review we used built-in GitHub image diff. We had with very few false positives in image diffing. Unfortunately, false positives rate was not zero - mostly due to subtle browser timing/random fluctuations.

Gemini is deprecated now, replaced by Hermione, from the same authors. I haven't used it, but it seems to use the same approach for image diffing. The core is in looks-same and gemini-core libraries.

aslushnikov commented 2 years ago

Thanks @shamrin for the pointers! I'll read your links in more details later to get a better understanding, but so far we already do all of these:

instead of using CIEDE2000, pixelmatch uses color difference in YIQ color space
pixelmatch uses the same algorithm based on the same whitepaper to ignore anti-aliasing
we hide text input caret on the browser level before making a screenshot

ayroblu commented 2 years ago

Hey! @aslushnikov I updated @florianbepunkt's original port of jest-image-snapshot to playwright test runner here: https://github.com/ayroblu/playwright-image-snapshot. Basically it looks VERY similar to playwright's existing golden.ts compare api and as you can see in matcher.ts.

The main benefit it is that it uses SSIM. I also updated how the diff is done so it's similar to pixelmatch's greyscale background which is super useful.

    expect(await page.screenshot()).toMatchImageSnapshot(test.info(), [
      name,
      "1-initial-load.png",
    ]);

Would love to have this SSIM option ported to playwright test as TestInfo is not exposed implicitly which makes the api usage a bit ugly. Made a PR #12258. I'm also hoping not to need to supply a file name by default, seems unnecessary.

aslushnikov commented 2 years ago

For the record: docker integration depends on global fixtures, so moving them forward.

bezyakina commented 2 years ago

Hi, @aslushnikov! Is it possible that in the next releases you will implement "slider" diff in the html report? There are cases where the slider is more convenient than the pixel highlighting method, especially when the length of the expected and actual screenshots differs.

It would be possible to implement one more tab in the report by analogy with Diff/Actual/Expected?

or you can display all 3 states on one tab in the report (as it looks in the attachments of this comment)

aslushnikov commented 2 years ago

@bezyakina not sure for 1.21 (we're about to finalize this version), but still possible! It all depends on how much our users need it.

So could you please file this separately to our bug tracker as a feature request? The more likes / upvotes it will collect, the higher priority will be for us, and the faster we'll implement it!

bezyakina commented 2 years ago

@bezyakina not sure for 1.21 (we're about to finalize this version), but still possible! It all depends on how much our users need it.

So could you please file this separately to our bug tracker as a feature request? The more likes / upvotes it will collect, the higher priority will be for us, and the faster we'll implement it!

thanks for your reply, created a new feature request - https://github.com/microsoft/playwright/issues/13176

AllanMedeiros commented 2 years ago

Hey there! Not sure if would be better to open another feature request, but https://github.com/jz-jess/RobotEyes has an interesting feature to ignore an array of UI elements in the image comparison, as these elements will be blurred, helping to achieve a higher percentage of fidelity (+95%) comparison. RobotEyes uses Imagemagick in the background which is a really powerful tool for image comparison. The idea is to ignore data elements from the screen before comparison is done. Taking that into account would require to set a different tolerance for each web page in the application, as each one can have different amount of UI elements with data. I've seen comments about blur, but it doesn't seem to be related to this... Thank you.

aslushnikov commented 2 years ago

@AllanMedeiros you can use the mask api to mask elements on the screenshot. This should help!

leotg130 commented 2 years ago

Would it be possible to do visual diff's even when the snapshot sizes differ (Sizes differ; expected image...)? Right now it seems the Playwright VRT refuses to do visual diffs for such snapshots.

Storybook addon storyshots at least has this feature, though it might come from pixelmatch, not sure. It's very convenient, as often there's white space changes and you can easily see that some padding has appeared somewhere.

tovab commented 2 years ago

Hi @aslushnikov Can you add an option for ignoring diffs where there is just some slight shift in location of pixels? I don't want to use threshold or maxDiffPixels for this because those options would cause false positives, i.e. they would cause the tests to ignore actual regressions. Here is an example of a diff that I would like to ignore: Actual image: reactions-selector-actual Expected image: reactions-selector-expected The diff: reactions-selector-diff

Thank you very much.

KirProkopchik commented 2 years ago

Hi @aslushnikov For now playwright can not compare the reference screenshot with actual instance in case if they have not equal resolution. This is not a big problem. But when I try to update such screenshot I also faced the error message about resolution mismatch. Error: Image sizes do not match. It is necessary to find such a screenshot and delete it. Then run the update again. This is not very convenient, for example, after updating the browser version. When many screenshots can be rendered differently.

emilio-martinez commented 2 years ago

@KirProkopchik different browsers and machines will produce different results. Have you tried running e2e tests within a docker image? We've been doing this for a few months and the results have been quite stable

sean-perkins commented 2 years ago

I agree that adding support for blurring would be extremely helpful. We are making use of maxDiffPixelRatio, but it is much of a guessing game of what value is appropriate to counter antialiasing/rendering differences vs. actual issues.

For example, I am currently having to set the value to 0.0005 to catch issues like this one (a real issue):

While wanting to skip/ignore false positives like this:

I worry with just the arbitrary pixel ratio of pixel differences, we will be adjusting that value overtime as certain screenshots will be more impacted by antialiasing and rendering imperfections.

We use Playwright for Ionic Framework, and test against Chrome/Firefox/Safari to verify correct Android and iOS design implementations for the web components.

kenkku commented 2 years ago

We have an interesting problem that's probably a common case when doing visual regression testing: we're taking a screenshot of an element (selected with a locator) that has a non-integer height. This results in an interesting problem where (at least when the device pixel ratio is 1) depending on what is on the rest of the page, sometimes the screenshot has a different height, or includes one extra row of the background color outside the element. I think this could even happen for elements that have an integer height, but that are positioned around elements that don't.

I think the most useful case for visual regression testing is for individual elements, and this does make it very hard to test those if you have any sub-pixel heights (or widths) on the page.

tovab commented 2 years ago

We have an interesting problem that's probably a common case when doing visual regression testing: we're taking a screenshot of an element (selected with a locator) .... I think the most useful case for visual regression testing is for individual elements, and this does make it very hard to test those if you have any sub-pixel heights (or widths) on the page.

I experienced this issue and my workaround was to take an image of the entire page and then crop the image to select the desired element before comparing.

const dimensions = await element.boundingBox(); expect(await page.screenshot({ type: 'jpeg', clip: dimensions as {x; y; width; height }, })).toMatchSnapshot(${name}.jpeg, {});

nmiddendorff commented 2 years ago

With the release of 1.21, Playwright now has the "Slider Diff View" which is great for comparing visual changes on the .toMatchSnapshot() assertion.

I'm curious how others plan to incorporate this into their software development workflow! It seems like the biggest piece missing from Playwright now is the ability to approve changes outside of running the app locally. (This is where backstopjs is still useful). Have others come up with a way to create some type of workflow on the PR that allows teams to easily review and approve changes?

To be clear, I'm not necessarily saying that this should be part of Playwright.

anduingaiden commented 2 years ago

Is there any way to take one screenshot of the entire page? We have many cases of "long pages" and for now, we are instructing our test script to, multiple times, scroll down the page and take a snapshot until we reach the end of the page. Then, we compare all the screenshots.

Is there an easier way to do it? If not, do u intend to add this kind of option?

Thanks in advance!

ayroblu commented 2 years ago

@anduingaiden Please see: https://playwright.dev/docs/screenshots#full-page-screenshots

anduingaiden commented 2 years ago

@anduingaiden Please see: https://playwright.dev/docs/screenshots#full-page-screenshots

It's clear that I did not explore this part of the documentation yet. Sorry about that and thank you very much.

pastelsky commented 2 years ago

Another example of where text rendering causes flake and blurring could've helped —

mrmckeb commented 2 years ago

Hi @pastelsky, are those screenshots taken on different operating systems? In a like-for-like environment, you shouldn't see flake from text rendering differences.

Different operating systems can have different default fonts, and (which appears to be the case here) different text rendering approaches.

aslushnikov commented 2 years ago

From maintainers

Hey folks! if you have examples of PNG screenshots that are taken on the same browser and same OS yet are different due to anti-aliasing issues, could you please attach the "expected", "actual" and "diff" images here?

This information will help with our experiments with fighting browser rendering non-determinism.

EchelonFour commented 2 years ago

I'm curious how others plan to incorporate this into their software development workflow! It seems like the biggest piece missing from Playwright now is the ability to approve changes outside of running the app locally. (This is where backstopjs is still useful). Have others come up with a way to create some type of workflow on the PR that allows teams to easily review and approve changes?

I have come up with a system where the devs can comment on the PR to run a CI task that reruns the tests with an --update-snapshots flag and pushes the any changes to the PR branch. But this requires rerunning the entire tests again, which is pretty slow considering the test report from the last run already has the accepted new snapshots in it.

It would be nice to have some kind of "accept snapshots" command we could run that takes an output from a test run where the snapshots comparison failed and it updates them from that. Even if it needs some kind of special report format, that would speed up this part of the workflow considerably.

aslushnikov commented 1 year ago

@gselsidi thank you for the sample!

gselsidi commented 1 year ago

I'll try to get some more as they come along, but i noticed the above occurs when taking snapshots of individual elements as opposed to the whole page. The whole page I'm able to use .0001 pixel ratio.

gselsidi commented 1 year ago

linking this here incase it applies:

https://github.com/microsoft/playwright/issues/19417

nikicat commented 1 year ago

I had different screenshots with antialiased fonts between my ArchLinux laptop and Ubuntu 20.04 in Docker (it's used by default by GitHub Actions). The following Chromium flags helped me to get identical screenshots:

--font-render-hinting=none
--disable-skia-runtime-opts
--disable-font-subpixel-positioning
--disable-lcd-text

phungleson commented 1 year ago

We have similar issues with webkit on mac around emojis, I am not sure if we can provide further information to make debugging/fixing the issue easier?

It looks like mask is not available to configure at PlaywrightTestConfig level?

microsoft / playwright

feat: better support for visual regression testing #8161

I'd appreciate if you could share screenshots (actual / expected) that fail for you with regular diff, but pass with preblur. This way we'll have some real-world data to play with!

From maintainers