Design a simple way for testers to contribute

jfhector commented 5 years ago

Here are some thoughts and a small design exploration from me. This is based on the assertion model that @mcking65 shared in a separate issue, and the conversation that we had reviewing it.

I do agree that we need to make things simpler and easier to understand for testers

@mfairchild365 This week I ran the test for aria-details on a11ysupport.io to familiarise myself with the data model and interface.

I struggled to figure out exactly how to perform the test. In particular:

After reading the assertions, I wasn't sure exactly how I should perform the test, and what constituted success or failure.
I was confused when prompted to select what command I used to perform the test, because I expected to be told what commands I should use. I wasn't sure which option was the right one for me to select.
I didn't easily understand the options available to me when inputting the position of the virtual cursor before and after the command (eg. 'target', 'start of target', ...). I expected to be told, and felt concerned that I might not perform the right test in the right way.

From that perspective, I think that the conversation we're having now about how to simplify things for users is very helpful and needed.

I also realise that the test pages and instructions on a11ysupport.io are generated efficently at scale, and that clearer and more granular instructions might not be viable.

My thoughts on the simplifications that @mcking65 has been exploring

Here's the link again to the assertion model that @mcking65 shared in a separate issue. My comments here focus not so much on the data model itself, but on how testing instructions can be presented.

What I think works

1. I like that Matt's table gives more specific instructions about test methods.

I think that it's important to tell testers what they need to do and how, rather than asking them what they did.

2. I like the idea of describing expected results in non-technical terms as much as possible.

I think that assertions should be written to help people who are not fully confident in their knowledge of accessibility.

3. I like the idea of organising testing instructions and results in a table.

Seeing the different tests in different rows, with instructions on the left and results on the right, felt very intuitive to me. It instantly gave me an clear mental model of how to use the interface (ie. the table).

4. I like the idea of wrapping different 'test methods' into one single test.

I like how a tester is expected to test how a checkbox is announced, when reached in different ways, in one go. I imagine that following this idea will help us simplify the interface, without making things harder for users. (Caveat: I don't fully yet fully understand the downsides of such a simplification for our data).

What I think we can simplify further

Here are a few suggestions:

1. Remove / hide columns that are useful to us but not to testers

The 'Importance' and 'Tested attributes' columns are important parts of our data model, but I imagine that we could hide this information when people are performing tests, so that the interface and what is required of them is easier to understand at a glance.

2. Organise columns in an order that makes the test instructions easy to read and understand

The 'Importance' column is currently placed between 'Test name' and 'Assertion'. I believe that we can make test instructions clearer by purposefuly ordering columns in a way that reads naturally.

I personally like using the Gherkin syntax. I think that it makes test instructions and assertions much easier to understand and nicer to read.

'GIVEN' is for precondition(s)
'WHEN' is the interaction that the user/tester performs
'THEN' is the result of that interaction (ie. how the interface reacts).

3. Merge some cells so that the structure of the table is obvious at a glance to sighted users.

E.g. The 'Screen Reader mode' cells could be merged so that it's immediately obvious that the top half of the table is for 'Reading mode', and the bottom half for 'Interaction mode'

Caveat: I don't know whether merging cells makes things harders to understand or operate for screen reader users.

A little exploration I did

I put together a simplified table in HTML following this direction.

The table is not interactive, and was done very quicly with barely any styling. (I initially had this in a Excel file but I wanted to have better screen reader support for the double headers and merged cells).

I imagine that testers will find a table like this easier to understand at a glance, and more inviting. Please let me know your thoughts.

I am imagining that our interface for contributing test results could be just a table for test case. (Of course this table would need to be dynamic based on what browser and assistive technology users have selected).

Please tell me what I'm not seeing / considering, so that we can make things better together.

What I'd like to explore next

Splitting the cells in the 'Then' column so that assertions are more granular.

My intuition is that we might be able to make things easier both for testers and for us by splitting the broad assertions in Matt's table into more granular ones. (eg. "role", "name", "state" into separate rows).

If we keep cells under "Given that" and "When" merged as is, I believe that we might be able to afford the extra granularity without making things look too complicated. (I'll give it a try).

mfairchild365 commented 5 years ago

This is incredibly good feedback and I think the HTML table that you made does a great job of illustrating how this could work. I'm also a fan of the Gherkin format, and I like how you structured the table to follow the same format.

I agree that running tests on a11ysupport.io is difficult. I put more time into rendering results than collecting results, which shows. You mentioned, "After reading the assertions, I wasn't sure exactly how I should perform the test, and what constituted success or failure." Can you explain this a little further and maybe give an example of why you were unsure?

jfhector commented 5 years ago

You mentioned, "After reading the assertions, I wasn't sure exactly how I should perform the test, and what constituted success or failure." Can you explain this a little further and maybe give an example of why you were unsure?

@mfairchild365 For example, this happened to me when I read the assertion "The screen reader MUST convey the presence of aria-details". I had never heard a screen reader render aria-details before. So I wasn't sure what I should be listening out for in this test. Also, on top of my head, I hesitated not knowing what element should be in focus for aria-details to be announced.

Now, looking back, the answers to my questions seem pretty obvious to me. But in the moment, these things contributed to me feeling disoriented, and unsure how exactly I should perform the test, and thinking I might not be doing it the right way.

So I'm imagining that it'd also happen to people who are less familiar with ARIA. (My understanding is that we're not expecting all testers to be fully knowledgeable and confident around ARIA).

jfhector commented 5 years ago

New, slightly different test table prototype

I tried a slightly different version of the test page I shared last week.

Here's the new, alternative test page.

In this new test table, I've unbundled assertions into separate rows. For example, there are now separate rows for things like:

The checkbox's role is announced
The checkbox's name is announced
The checkbox's state is announced

I did this to help us decide whether granular assertions should be be in separate rows, or whether we should bundle several of them into one row.

Links to the two version of the table I've shared so far

Here are the two versions again, in one place:

Merged cells: is the experience good for screen reader users?

@mcking65 These table prototypes include cells that span several rows. Concretely, I have merged cells that contain the same information across adjacent rows.

My goal was to make the table easier to understand and less daunting. But I'm not sure whether a table using vertically merged cells is easy enough to understand for screen reader users. If you think that it's not, please let me know and I'll try something different.

jfhector commented 5 years ago

Here's an updated list of all the prototypes I've shared so far:

Test page (July 3 prototype).
Test page (July 10 prototype). This one splits things like "name, role and value" into three separate assertions. We probably won't do this.
Test page (July 31 prototype). This one splits test instructions and test results into two separate tables. This is easier to navigate, and means that we can use simple, more usable tables.

spectranaut commented 5 years ago

I really appreciate you doing all this work, @jfhector, it really brings alive the work that @mcking65 has done in issue #5. But I have an initial thoughts I'd like to try to outline.

If I were performing these tests, as someone with little AT experience, I would prefer instructions that are extremely step-by-step (so I don't think i might be missing a hidden step do to my lack of close familiarity with ATs). I would also want to feel certain that I thoroughly performed a test. This leads me to move back towards a higher granularity for test definition and test result recording.

For example, looking at the first tests in @jfhector's mock up, instead of supported/not supported/partially supported, I'd prefer an explicit list and checkboxes for recording results for each navigation/information pair (see the explicit example written out at the end of this comment).

My intuition is that recording this granularity wouldn't add much time to a testers workflow because check boxes are quick and easy to check. It also might make testing easier -- if I were simply testing the widget on my own with this many kinds of navigation and items to listen too, I might forgot which key stroke didn't trigger which particular information when I'm writing up results in the "notes" field. I might have to go through them all over again. I might even keep my own list somewhere else while testing to make sure I was being thorough.

Example for first test of checkbox:

Navigation via Insert+Tab
- [ ] Checkbox is announced
- [ ] Checkbox’s name is announced
- [ ] Checkbox’s state as announced
Navigation via Insert+Up
- [x] Checkbox is announced
- [ ] Checkbox’s name is announced
- [ ] Checkbox’s state as announced
Navigation via X quick key
- [ ] Checkbox is announced
- [ ] Checkbox’s name is announced
- [ ] Checkbox’s state as announced
Navigation via Tab / Shift+Tab
- [ ] Checkbox is announced
- [ ] Checkbox’s name is announced
- [ ] Checkbox’s state as announced
Navigation via Up/Down
- [ ] Checkbox is announced
- [ ] Checkbox’s name is announced
- [ ] Checkbox’s state as announced
Navigation via Left/Right (with Smart Nav On)
- [ ] Checkbox is announced
- [ ] Checkbox’s name is announced
- [ ] Checkbox’s state as announced

mcking65 commented 5 years ago

@spectranaut, perhaps we should continue to experiment with both granular and less granular approaches. Maybe we can support both.

The more granular approach was what I initially thought we had to do -- 1 assertion for every combination of key command and accessibility attribute. For the checkbox, that results in:

Read checkbox in reading mode: 18 assertions (that you listed above)
Operate checkbox in reading mode: 2 assertions
Read checkbox grouping in reading mode: 4 assertions
Read checkbox in interaction mode: 4 assertions
operate checkbox in interaction mode: 2 assertions
Read checkbox grouping in interaction mode: 4 assertions

That is 34 assertions (checkboxes to check or not check). Maybe this is necessary; maybe it is more efficient. That is not my intuition. I would be spending a lot of time making sure I had the right ones not checked.

This number baloons when you get to menubar or grid. There are so many more key commands and accessibility attributes. I haven't done precise math for menubar, but I estimate that it is somewhere in the 200 to 400 assertion range. This is for each screen reader browser combination. That would mean 1400 to 2800 assertions for the menubar pattern to get data for the first 7 screen reader browser combinations. It would be a useful exercise to map that one out thoroughly.

We have had discussions surrounding who we want to do the testing. Should it be community based or not? If yes, to what extent?

From what I have seen over the years, I am not convinced we can rely on only volunteer testers. I don't want to say we can't get a portion done that way, but I am fairly certain that we will need dedicated testers with appropriate training and time to work through all the tests in a timely manner, especially if we want to run most of the SR/Browser combinations a couple times a year. I could be wrong on this point. If we make the UX so good that people are keen to help, with the increased level of interest in accessibility that we have now compared to years past, maybe volunteer input would be sufficient.

There could be some clever middle ground based on logical groupings of commands and accessibility attributes that give the tester enough of a checklist to ensure thoroughness, but not so tedious that we make the process unnecesarily inefficient and increase the likelihood of errors. I am also thinking about the errors we may have when generating the tests themselves.

spectranaut commented 5 years ago

I see, @mcking65 thanks -- this well illustrates the concern you mentioned about the exponential growth of assertions.

We could probably design an interface to record only the failures, instead of having to record every success or failure. Maybe that will cut down on what seems like an overwhelming number of assertions. I think it is important to note, either way, the assertions are there -- we are discussing difference in the way the test is presented and the way results are recorded, but in every draft discussed in this issue we still want testers to test every "assertion". For example, we want the tester to listen for all 3 pieces of information for every method of navigation in "Read checkbox in reading mode", whether or not they record the every result. So the question is how can we design the test and get the results we want while allowing the tester to test most efficiently and comfortably.

Questions to answer:

How granular do we want results to be? Probably we should answer this according to the user stories.
What makes it easiest for testers to get us this information? (we can run some trials in the prototype! or maybe before)

I'd like to stress that I hope we can get structure data in response to the test. If we can automate these tests in the future, it would be nice to have continuity of structured test results from the time of manual tests to the time of automated tests. Additionally, I think it's better if we design the tests so that AT user and web programmers can get the "support" information they need from structure data, and not have to dig into the notes field. So if we want a test to reveal information like "what information wasn't announced", I don't want that recorded in the notes field, I'd prefer that in structured data.

While the committee is designing the tests and decided what "support" means for an AT, the notes field from manual testers might prove invaluable. For the purposes of iterating on the test itsself. But I might like to know what you had in mind for what testers should put in the notes field, Matt.

jfhector commented 5 years ago

@mcking65 and I had a useful conversation about this on the call today. The minutes are here: https://www.w3.org/2019/08/28-aria-at-minutes.html

There is a mistake in my notes:

On the interface/table that a tester users, there might be: • at least one row for each key command; • Test result; • Anything needed to make sure that testing that key command in that circumstance is a repeatable thing (e.g. mode) • Something about result and output (JF: I didn't get that)

I believe that Matt wasn't talking about the "interface/table that a tester uses", but more about the data model.

Later in the notes, it becomes clear that we'd like interface/table that testers use to be much simpler, and only progressively disclose extra fields based on their answers. This is described later in the notes.

spectranaut commented 4 years ago

Closed as tests contribution format has been mostly finalized.

w3c / aria-at