w3c / wcag

Web Content Accessibility Guidelines
https://w3c.github.io/wcag/guidelines/22/
Other
1.06k stars 235 forks source link

Validity of independent review (retitled) #1622

Open DavidMacDonald opened 3 years ago

DavidMacDonald commented 3 years ago

The document cites a Brajnick et al., 2012 study which is not provided, and is not in the reference list and based on one bullet point it cites from that it states:

"This means that an 80% target for agreement, when audits are conducted without communication between evaluators, is not attainable, even with experienced evaluators." link to quote in Challenges doc

I think this is unnecessarily disparaging to WCAG. I think this sentence should be removed. I recently evaluated an international site. I was in Canada and another professional in Paris conducted an evaluation of the same pages without any communication. We had a strong correlation. Much higher than 80%.

alastc commented 3 years ago

Hi @DavidMacDonald, sorry, which document?

bruce-usab commented 3 years ago

That's a important catch @DavidMacDonald, so I hope you can track down the right GitHub page to be suggesting an edit!

The not attainable conclusion (as represented in Challenges doc) is just factually incorrect (1) because I am pretty sure it misrepresents (mathematically) what we as WG members understand of as 80% reliability for inter-rater agreement, and (2) is disproven (as it is currently stated) by a single counter-example (as you provided).

I will note that a core motivation for the methodology underpinning DHS Trusted Tester is have a repeatable process which is as unambiguous as humanly possible. They aim for much, much higher agreement on any one test than 80%! Moreover, the TT credentialing process is aimed at allowing inexperienced evaluators to achieve high inter-rater reliability.

Finally, I would note that the ACT rules aim for 100% inter-rater reliability.

I would love to read the article though.

EDIT: Here is a cite from ResearchGate. Title is Is accessibility conformance an elusive property? A study of validity and reliability of WCAG 2.0 and the authors last name is Brajnik (not Brajnick, so no c).

bruce-usab commented 3 years ago

I am of the opinion that just deleting the last bullet (exactly the bit that David excerpted) is a reasonable fix for now.

My pull request also corrects the spelling of author's last name, adds the article title, and provides a link. I don't think this article needs to be added to the References section at this time.

bruce-usab commented 3 years ago

I ran the article abstract by my colleague @kengdoj and though I would share her observations:

jspellman commented 3 years ago

I think we need a different solution, as I have spent several hours searching old archives to see if I can find the original paper. We had a data loss when the structure of Google Drive that belonged to a no-longer-active member was deleted. The data still exists (so I am told), but I can't find it. As a side note, we are encouraging W3C to find a Google Drive solution, because Drive is accessible to some people with disabilities and we expect to keep using it in the future.

I have been thinking about the possibilities of addressing the problem. First, the paper with the 80% figure is dated. It would be helpful to find the date, but I remember it as being associated with the release of WCAG 2.0, so I would suspect it is in the 2008-2012 time frame. If there is more recent research with a different percentage, then I would recommend using it. I don't think the Silver Task Force would object to using updated research. Otherwise, use the 80% with the note that the research is associated with the release of WCAG 2.0.

Members of the Silver Task Force (myself included) have been loath to see the Silver Problem Statements submerged in the Challenges document because they were the result of research with academic and corporate researchers. However, I would like to propose a way forward. I would be amenable to paraphrasing the Silver research results as long as there are frequent references to the Silver Problem Statements.

The Silver research was broader in scope than the Challenges, because the Silver research addressed a wider population than large organizations. I still do not want to see the Challenges document used to justify changes to the WCAG3 Requirements or to WCAG3 itself. The Challenges document is the opinion of a relatively small (but influential) group of people and should not be considered of greater importance than the research.

A paragraph in the Introduction could explain that. I am open to further discussion and ideas of a way forward. I would also like to hear from @slauriat on this issue. I have flagged it as a topic for a Silver leadership discussion.

bruce-usab commented 3 years ago

@jspellman I am pretty sure I linked to the article in question in my first reply in this issue thread. Here is that URL: https://www.researchgate.net/publication/235339930_Is_accessibility_conformance_an_elusive_property_A_study_of_validity_and_reliability_of_WCAG_20

March 2012 is the date. I tried to get the article text directly via ResearchGate but they have not approved my request (even after I ticked the boxes for reconsideration). I choose to believe that it is an automaton making that choice!

@sajkaj - I think you may have renamed this issue with maybe what was supposed to be a comment. I cannot quite tell what is going on.

johnfoliot commented 3 years ago

From the referenced URL:

Date:

March 2012

Abstract:

The Web Content Accessibility Guidelines (WCAG) 2.0 separate testing into both “Machine” and “Human” audits; and further classify “Human Testability” into “Reliably Human Testable” and “Not Reliably Testable”; it is human testability that is the focus of this paper. We wanted to investigate the likelihood that “at least 80% of knowledgeable human evaluators would agree on the conclusion” of an accessibility audit, and therefore understand the percentage of success criteria that could be described as reliably human testable, and those that could not.

In this case, we recruited twenty-five experienced evaluators to audit four pages for WCAG 2.0 conformance. These pages were chosen to differ in layout, complexity, and accessibility support, thereby creating a small but variable sample. We found that an 80% agreement between experienced evaluators almost never occurred and that the average agreement was at the 70--75% mark, while the error rate was around 29%. Further, trained—but novice—evaluators performing the same audits exhibited the same agreement to that of our more experienced ones, but a reduction on validity of 6--13% ; the validity that an untrained user would attain can only be a conjecture. Expertise appears to improve (by 19%) the ability to avoid false positives.

Finally, pooling the results of two independent experienced evaluators would be the best option, capturing at most 76% of the true problems and producing only 24% of false positives. Any other independent combination of audits would achieve worse results. This means that an 80% target for agreement, when audits are conducted without communication between evaluators, is not attainable, even with experienced evaluators, when working on pages similar to the ones used in this experiment; that the error rate even for experienced evaluators is relatively high and further, that untrained accessibility auditors be they developers or quality testers from other domains, would do much worse than this.

While the data is 10 years old, I believe the main conclusions remain relevant, as they were evaluating "..human testability that as the focus of the paper..." and the ability to elicit reliable results rather than which version of WCAG they were using. If newer research is available I'm all for reviewing it.

JF

On Mon, Mar 8, 2021 at 7:48 AM Bruce Bailey notifications@github.com wrote:

@jspellman https://github.com/jspellman I am pretty sure I linked to the article in question in my first reply in this issue thread. Here is the URL:

https://www.researchgate.net/publication/235339930_Is_accessibility_conformance_an_elusive_property_A_study_of_validity_and_reliability_of_WCAG_20

March 2012 is the date.

@sajkaj https://github.com/sajkaj - I think you may have renamed this issue with maybe what was supposed to be a comment. I cannot quite tell what is going on.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/w3c/wcag/issues/1622#issuecomment-792733210, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJL44YPFFRQD6BJ66MXH2LTCTBR5ANCNFSM4XFYAR2Q .

-- ​John Foliot | Principal Accessibility Specialist

"I made this so long because I did not have time to make it shorter." - Pascal "links go places, buttons do things"

bruce-usab commented 3 years ago

@johnfoliot et al., the problematic bullet @DavidMacDonald cites in this issue is a direct excerpt from that abstract you pasted in: This means that an 80% target for agreement, when audits are conducted without communication between evaluators, is not attainable, even with experienced evaluators. See Conformance Challenges — Themes from Research.

From the abstract, it does seem to be true that these researchers came to that conclusion.

It is not, however, a factually correct statement. It does not IMHO belong in the Challenges document. Moreover, the formatting ascribes more authority than the bullet warrants. Maybe it is just me, but before digging out that citation, I didn't realize the bullet was a quotation. After reading the abstract, I would argue that characterizing that bullet as a theme from research really overstates what is really just one data point. It is an assertion from one study, where the authors own abstract provides evidence of a flawed methodology.

johnfoliot commented 3 years ago

@bruce-usab I'm not really sure what your point is: are you suggesting that the conclusion is not a fact? I disagree - it is a fact. It may not be a conclusion that everyone agrees to, but the conclusion as published is a fact: it's their conclusion.

More importantly, it is research that is supporting concerns raised by multiple parties (including myself) around the need for non-subjective measurements for conformance. Reliance on individual subjective determinations will certainly introduce the types of concerns addressed by this research paper, and with the current trajectory, likely introduce more concern, not lessen it.

And while it may only be one bullet point (data point) it is none-the-less a significant one, and one (again) backed by some research by academics.

On Mon, Mar 8, 2021 at 11:17 AM Bruce Bailey notifications@github.com wrote:

@johnfoliot https://github.com/johnfoliot et al., the problematic bullet @DavidMacDonald https://github.com/DavidMacDonald cites in this issue is a direct excerpt from that abstract you pasted in: This means that an 80% target for agreement, when audits are conducted without communication between evaluators, is not attainable, even with experienced evaluators. See Conformance Challenges -- Themes from Research https://www.w3.org/TR/2020/WD-accessibility-conformance-challenges-20200619/#theme .

From the abstract, it does seem to be true that these researchers came to that conclusion.

It is not, however, a factually correct statement. It does not IMHO belong in the Challenges document. Moreover, the formatting ascribes more authority than the sentences warrants. Maybe it is just me, but before digging out that citation, I didn't realize the bullet was a quotation. Now I would argue that characterizing that bullet as a theme from research really overstates what is really just one data point.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/w3c/wcag/issues/1622#issuecomment-792869380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJL44YCEXJ3VJR5RSXR5QTTCT2BNANCNFSM4XFYAR2Q .

-- ​John Foliot | Principal Accessibility Specialist

"I made this so long because I did not have time to make it shorter." - Pascal "links go places, buttons do things"

bruce-usab commented 3 years ago

@johnfoliot GitHub put my addy in plain text so I edited your comment. (Not that my email is hard to find, but who needs the extra spam?) FWIW, I don't seem to have your current email (or I would have asked you to edit your comment). Also, I find it surprising that I could edit your comment!

Correct, I am saying that the conclusion is not fact.

It may be a fact that the paper authors made such a conclusion, but I regard that as irrelevant to the premise that AG WG should include this particular bullet in the Conformance Issues document. But without reading the article, I am not confident that the authors reach this conclusion. The phrasing used in the abstract is not entirely unambiguous.

FWIW, I agree that inter-rating reliability is not what we want it to be. But I strongly disagree that an 80% target for inter-rater reliability is not attainable. The assertion that an 80% target for agreement is not attainable is simply not credible. Assertions which are not credible should not be repeated verbatim in an AG WG document. (Or at least not without lots of context and/or caveats.)

johnfoliot commented 3 years ago

I'm sorry Bruce, but the authors of that paper came to a conclusion: that is an undisputed fact.

You may not agree with the conclusion, you may believe that it is not relevant to the discussion, but it is factual that they came to a conclusion, and their conclusion is one that supports concerns and comments that others have articulated (including myself).

I noted with interest the comments from Gregg Vanderheiden https://lists.w3.org/Archives/Public/public-silver/2021Mar/0002.html, former chair of the WCAG 2.0 Working Group, who wrote:

"I know the pain that can lead one to do this. We had the same problem in WCAG 2.0. We actually spent enormous time on cognitive language learning disabilities for example (more than on any other single disability) trying to find provisions that would address their needs and yet would be objective and meet the criteria necessary for a testable provision. We called in Nancy Ward, Clayton Lewis and a whole host of other people to talk with us and propose provisions that might work. John Slaton and I launched two, many-months-long efforts on both the cognitive language and learning disability area and the use of plain language in the guidelines. It was the most frustrating thing I have ever done in my life. Seeing the needs, but being unable to identify or find ways to qualify as strategies from all the materials we read, and people we talk to, was the most difficult and frustrating part of the work on WCAG.

In the end the group will need to either rename the document and have it be a really wonderful guidance document with broad scope for including guidance provisions, or return to the WCAG 2x like criteria for selecting provisions that is needed in a standard that could be adopted in regulation. This latter choice would, of course, put you back in the same bind as the existing WCAG 2 thread. It is aggravating, bang-head-against-the-wall frustrating, etc. but that is the situation."

Gregg also notes:

"If the provisions in a standard are not objective, the very first time it shows up in court, the defendants will cite, accurately, that the provision is not objective but rather is subjective. And as a result, it is not enforceable."

One of the conclusions of that paper (as I understand it) is that when it comes to subjective evaluations, they were unable to demonstrate that even experienced evaluators could agree on some of the subjective determinations we already have in WCAG 2.x. Gregg continues:

In order for something to be a standard, particularly a standard that is going to be used in regulation of any type,

[JF: this is where the cited research paper is relevant, as the research concluded that this "high-level 'inter-rater-reliability'" could not be proven today]

This constrains the types of provisions or requirements that you can have in a standard. Often leaving out guidance you would like to include but cannot reduce to an objectively testable requirement."

I eagerly anticipate the WG's response to Gregg's comments.

JF

On Mon, Mar 8, 2021 at 12:28 PM Bruce Bailey notifications@github.com wrote:

@johnfoliot https://github.com/johnfoliot GitHub put my email in plain text so I edited your comment. (Not that my email is hard to find, but who needs the extra spam?) FWIW, I don't seem to have your current email.

Correct, I am saying that the conclusion is not fact.

It may be a fact that the paper authors made such a conclusion, but I regard that as irrelevant to the premise that AG WG should include this particular bullet in the Conformance Issues document. But without reading the article, I am not confident that the authors reach this conclusion. The phrasing used in the abstract is not entirely unambiguous.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/w3c/wcag/issues/1622#issuecomment-792927840, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJL443IFVSOQKEL64NOOYDTCUCM7ANCNFSM4XFYAR2Q .

-- ​John Foliot | Principal Accessibility Specialist

"I made this so long because I did not have time to make it shorter." - Pascal "links go places, buttons do things"

bruce-usab commented 3 years ago

@johnfoliot , sticking to the very narrow issue raised by @DavidMacDonald at the start of this thread, which I now find that @sajkaj has (accidently) over-written, we can recognize the real concern raised by these researchers without this particular quotation from the abstract. Further, I would argue that including the quote (because it is so easily debunked) is counter-productive to the important lesson that inter-rater reliability needs to be improved.

sajkaj commented 3 years ago

My apologies to everyone, and especially to @DavidMacDonald, for over-writing the head of this issue. David's original text has now been restored. It had been my intent to comment, but I misused the hub command. My apologies.

sajkaj commented 3 years ago

The section of the Challenges document being discussed in this issue is a straight copy and paste from Silver Problem Statements. As @jspellman notes above, there was a data loss event that resulted in a loss of all the hyperlinks in the original, as well as in the copy submitted into the Challenges doc. So, I am gratefully accepting the citation on behalf of the Challenges doc, and I'm sure a PR against the original would also be welcome. If you can help with additional citations missing from Challenges (and from the upstream doc), I'm confident we'd all appreciate having those. I am, however, leaving the conclusion drawn from Silver Research for further discussion in Silver and AGWG. I don't feel it's appropriate for me, as document Editor, to make that substantive change on my own. Meanwhile, please note the current Editor's Draft for Challenges has moved Section 5 to an [https://raw.githack.com/w3c/wcag/conformance-challenges-5aside/conformance-challenges/index.html#silver-research-problem-statements](Appendix C in the latest Challenges draft). Please now create PR against that draft.

bruce-usab commented 3 years ago

Reopening because the pull request did not actually address the issue raised by @DavidMacDonald. In my opinion, this is something the AG WG would appreciate having called to their attention.

alastc commented 3 years ago

Hi @sajkaj, as the Silver problem statements are not an official (draft) note, there is a higher bar. This issue should remain open until the original point is addressed, or it comes to the group to agree not to address it.

sajkaj commented 3 years ago

Agreed. My bad--yet again in this issue. I meant to hit "Comment," not "comment and close." But, I was in too much of a hurry to post before being late to the Silver call. I agree the underlying question remains unresolved, even if the citation is now available.