Which browser / AT combinations to test

zcorpan commented 4 years ago

Currently, the plan is to run tests in:

JAWS/Chrome
JAWS/Firefox
NVDA/Chrome
NVDA/Firefox
macOS VoiceOver/Chrome
macOS VoiceOver/Safari

From #111, there's a suggestion from Vispero to test JAWS / Chromium-based Edge instead of JAWS / Firefox.

@JAWS-test wrote

If I understood the Vispero meeting notes correctly, it was suggested that instead of testing with Firefox, it would be better to test with Edge. I don't think that's a good idea, because in my opinion the AT support of Chrome and Edge are hardly different. I.e. the test results for Chrome will mostly apply to Edge as well. Firefox, however, is a significantly different browser. In my tests with JAWS, I have not seen any differences between Chrome and Edge, but I have seen many differences between Chrome and Firefox. Therefore I think that a test with Firefox makes sense.

Otherwise we should use this table https://webaim.org/projects/screenreadersurvey8/#browsercombos as a guide

Originally posted by @JAWS-test in https://github.com/w3c/aria-at/issues/111#issuecomment-597430634

zcorpan commented 4 years ago

Feedback from Google's web accessibility team (#111) on this

Aaron: why support Edge when it’s based on chromium? Is the same exact code base. You will be testing the same thing twice if you do that. Aaron: I think the only differences you’ll find are based on versions.

cc @aleventhal

FWIW, I agree with this. It seems more useful to test two separate browser engines, than testing two Chromium-based browsers.

mfairchild365 commented 4 years ago

I'd also like to get confirmation from Apple that we should be testing VoiceOver+Chrome. According to the WebAIM survey, that combination is at 3% usage, and it is fairly widely known that combination does not currently work well.

zcorpan commented 4 years ago

I'd also like to get confirmation from Apple that we should be testing VoiceOver+Chrome.

@cookiecrook, can you comment on this?

robfentress commented 4 years ago

Which version of each screen reader will we be using? Also, I just want to confirm that we will be testing with the Chromium rather than the EdgeHTML build of the Edge browser. I think, generally, that we should also specify the version of each browser to be used. I apologize if this is listed elsewhere, but I'm just starting to engage with the group and am trying to get up to speed.

zcorpan commented 4 years ago

I think we want to test whatever is the latest "stable" version at the time of testing. We should make a decision to that effect and document it, though.

We also need to resolve this issue soon.

cookiecrook commented 4 years ago

I'd also like to get confirmation from Apple that we should be testing VoiceOver+Chrome. @cookiecrook, can you comment on this?

Apple doesn't have an official stance on which combinations get tested for the ARIA-AT project.

But personally, I would agree that testing WebKit/Safari+VoiceOver is higher priority. Whether you should test more depends on your time availability and testing capacity.

Sorry if that's not a very satisfying answer.

mcking65 commented 4 years ago

We discussed this in today's meeting.

We focused on browser version first asking what happens if the tester is using a managed browser that auto-updates to latest stable version. There was a fair amount of discussion of VMs and other ways of ensuring the precise browser build being used does not change.

Thinking a bit more, and going back to first principles and goals, I wonder if controlling the browser version so precisely is essential. It certainly adds substantial complexity, especially for testers.

An alternative to precise requirements for browser version could be requiring a minimum browser level and build type (e.g., stable, beta, nightly) within a given test cycle. For example, a test cycle could require Chrome Stable, version 80+. That way, if Chrome 81 becomes the stable version during the test cycle, it would be considered acceptable.

I think we should compare the ramifications of requiring a precise browser version to requiring a minimum browser version within a given test cycle. I think this is easiest to do by talking about some specific scenarios.

Scenario Conditions

A test cycle includes Chrome 80, JAWS 2020, NVDA 2019.3, and 4 test plans (checkbox, combobox, grid, and menubar).
There are 2 testers, T1 and T2. Each tester will perform all test runs -- 8 per tester (1 browser 2 screen readers 4 plans).

Scenario 1 - browser is updated mid-cycle without effecting test results

Day 1: T1 and T2 execute checkbox and combobox testing using Chrome 80 with JAWS and NVDA. All assertions pass.
Day 2 morning: Chrome updates to version 81.
Day 2: T1 and T2 execute grid and menubar testing using Chrome 81 with JAWS and NVDA. All assertions pass.

Are there any problems with this scenario? As we answer this, we need to keep in mind that the purpose of the project is to improve assistive technology support for ARIA. Identifying and resolving problems with browser dependencies is serendipitous. When there are meaningful differences in support among browsers, the resolution falls to the ARIA working group, not the ARIA-AT project. Whereas, resolving deleterious differences among assistive technologies falls within the scope of ARIA-at.

The end result of scenario 1 is that readers of the APG will see that the latest testing of checkbox with Chrom is with version 80 and the latest results of testing grid with Chrome are with version 81. I don't think this is a problem. Think down the road just a bit when there are 75 test plans run with 10 assistive technologies across 6 browsers. Until is possible to automate all regression testing of assistive technology support, it is likely such testing will be spread across several months. It may be split into several different test cycles. It would be impractical to promise that every ARIA pattern will be tested with the same combination of technologies.

Also consider that the support for ARIA in leading browsers is relatively mature and robust. Obviously, the internals are complex and regressions do occur. Generally, though, browser issues should be uncommon. The question is whether browser bugs or regressions have to interfere with ARIA-AT. To answer that, let's consider scenario 2.

Scenario 2 - Browser is updated mid-cycle and test results conflict

Day 1:
- T1 executes checkbox and combobox tests using Chrome 80 with JAWS and NVDA. All assertions pass.
- T2 executes combobox and grid tests using Chrome 80 with JAWS and NVDA. All assertions pass.
Day 2 morning: Chrome updates to version 81.
Day 2:
- T1 executes grid and menubar tests using Chrome 81 with JAWS and NVDA. Some grid assertions fail.
- T2 executes checkbox and menubar tests using Chrome 81 with JAWS and NVDA. All assertions pass.

Results are:

Checkbox passed with both chrome 80 and 81.
Combobox pass with chrome 80.
Grid passed with Chrom 80 and failed with Chrome 81.
Menubar passed with Chrome 81.

Are there any problems here? If so, how could they be handled?

For combobox and menubar, the situation is just like scenario 1. Each passed with T1 and T2 using the same AT/Browser combinations. Combobox will be reported with a browser version of 80 and menubar will be reported with a browser version of 81 -- no problem.

For checkbox, the results are identical, regardless of browser version. It would make sense to report the final data with the later browser version. The likelihood that T1 results would be different if re-run with Chrome 81 is too close to 0 for concern. If we wanted to be ultra conservative, T1 could be asked to re-run the test with Chrome 81, but I think that is entirely unnecessary.

The question is what to do about grid. The system will show that there are conflicting raw results. The ARIA-AT system will require the conflict to be resolved before the raw results can be marked draft.

There are several possible causes of the differences for grid between the T1 and T2 results:

One of the testers didn't understand one of the assertions. Both testers recorded the same screen reader output but one of them interpreted it as a fail and the other as a pass.
One of the testers simply made a mistake when recording results.
Chrome 81 caused one or both of the screen readers to generate output that is different from the output generated when using Chrome 80.

Causes 1 and 2 are already addressed by our process. So, the change in browser version is a non-issue. Once corrected, the result is the same as for checkbox; we could report the results using the later browser version.

The resolution to cause 3 can also be addressed by our process. When T1 ran the grid test and was notified of differing results, that would have triggered the process for resolving result differences. Once the testers are satisfied they both interpret results the same way, they would land on the different output as the cause. This would trigger T2 to re-run grid with Chrome 81, and the difference in results would be resolved and the results would be reported with 81 as the browser version. This could also result in one of the testers raising a Chromium bug.

Since there is a failed assertion, and it is known that the browser is the cause, I think we need a way of tracking that in the report so that assistive technologies are not seen as failing to support the expectation declared by the failed assertion. We currently do not have that in our model. However, that issue is separate from the browser version issue.

We can analyze more scenarios. But, my initial thought about using a minimum browser version requirement is that it is adequate and could reduce complexity.

There are a couple of AT/browser combinations that are tied to the operating system version -- VoiceOver with Safari on macOS and iOS. With these combinations, it is probably best that T1 and T2 sync up so that they run the same plans within the same time frame to reduce the probability of a forced upgrade getting the way of having matching results. Outside of that, it is easy for users to control the version of the AT. And in fact, many support having multiple versions on the system at the same time.

So, I think we should consider having test cycles specify minimum versions and that we should work to ensure consistency across a cycle. However, if a forced upgrade interferes with consistency, we can manage that at the level of individual test runs. That might lead to some test runs within a cycle being reported with a slightly newer version of browser, and in rare cases AT, but I don't think that will have a negative impact on the value of any of the results.

zcorpan commented 4 years ago

On the practical side, if we run tests manually on testers' own systems, I agree that it's problematic to require a specific version of software that normally auto-upgrades, but we can require a minimum version and a specific release channel ("stable" vs "beta", etc).

However, it's another possible cause of differences between results, which could take more time to resolve. The final reports would also be less clear about what they apply to, if different tests were run in different versions. I think it's OK, but not ideal in the longer term.

Recording the actual browser version while running the test is technically possible, and seems helpful to have when resolving differences in test results. I think we haven't planned for this in the runner, though. cc @s3ththompson @spectranaut @evmiguel

mfairchild365 commented 4 years ago

However, it's another possible cause of differences between results, which could take more time to resolve. The final reports would also be less clear about what they apply to, if different tests were run in different versions. I think it's OK, but not ideal in the longer term.

I'll echo that. I like the idea of specifying a minimum version, but per my previous testing, it is fairly common to find browser bugs that affect AT support.

I think it is okay to move forward with the minimum version strategy. We can revisit later if it turns out to be too much overhead or if it adds too much complexity to the project.

robfentress commented 4 years ago

I wonder AssistivLabs would be willing to donate access to their virtual environments for screen reader testing. I had an interesting conversation with Weston Thayer from that company a while back about setting up accessible RDP sessions a while back. I could reach out to him if folks think that might be a useful strategy.

zcorpan commented 4 years ago

Thanks, @robfentress. We have an email thread with @WestonThayer about this.

zcorpan commented 4 years ago

The latest stable versions right now:

Product	Version
JAWS	2020.2004.66 - April 2020
NVDA	2020.1
macOS	10.15.4
Windows 10	1909, OS build 18363.815
Chrome	83.0.4103.61
Firefox	76.0.1
Safari	13.1

robfentress commented 4 years ago

Feedback from Google's web accessibility team (#111) on this

Aaron: why support Edge when it’s based on chromium? Is the same exact code base. You will be testing the same thing twice if you do that.

Is this really true? I thought Microsoft Edge and Chrome used different accessibility APIs, even though they both run on Chromium. Wouldn't this be expected to cause different results in some circumstances?

zcorpan commented 4 years ago

cc @smhigley for the above question

zcorpan commented 4 years ago

Update: NVDA latest version is now 2020.1. We should use this for the pilot test.

The versions for browsers is minimal version -- auto-updates to something later is ok.

The version for ATs should be the exact version listed above.

w3c / aria-at