jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.03k stars 3.35k forks source link

Pandoc 2.x renders images' alternative texts in an inaccessible fashion #6491

Closed jmuheim closed 4 years ago

jmuheim commented 4 years ago

As stated on StackOverflow (https://stackoverflow.com/questions/62639927/pandoc-2-x-renders-images-alternative-texts-in-an-inaccessible-fashion?noredirect=1#comment110781365_62639927), Pandoc 2.x renders images' alternative texts in an inaccessible fashion. I was told there to ask for a bugfix here.


Here's the original post:

Since I upgraded from Pandoc v1.19 to 2.9, decorative images are not exported as expected anymore.

First of all, when generating HTML from ![](test.jpg), in v1.19 a <p class="figure"> structure was wrapped around the image, but now it's only a <p>:

<p>
  <img src="test.jpg">
</p>

This makes it harder to style in line with other images that have an alternative text.

But what's really a problem here: there's no alt="" attribute produced anymore! This means that e.g. screen readers will not recognise this as a decorative image anymore.

So let's see what happens to an image with an actual alternative text, e.g. when generating HTML from ![Hello](test.jpg):

<div class="figure">
  <img src="test.jpg" alt="">
  <p class="caption">Hello</p>
</div>

Here we get a class="figure" in the surrounding element, but now it's a <div> instead of a <p> (I don't bother too much about this, but again, it makes it harder to style everything the same).

What again is a big problem though is the fact that the alt attribute is now set empty: this prevents screen readers from perceiving them at all, which is horribly wrong! I guess that Pandoc concludes that having alternative text and caption would be redundant, which is correct, and that the caption below would be the right thing to show - which it is not.

The right structure would look something like this:

<div class="figure">
  <img src="test.jpg" alt="Hello"><!-- Leave the alternative text on the image -->
  <p class="caption" aria-hidden="true">Hello</p><!-- Hide the redundant visual alternative text from screen readers -->
</div>

Any reason why this behaviour would make sense? Can it be changed somehow? Otherwise I will have to fiddle around with some post-processing JavaScript...

tarleb commented 4 years ago

I started to implement this, but was given pause by the fact that this would cause pandoc to produce invalid xhtml when targeting HTML4. @jmuheim, do you know of a good workaround for HTML4?

On the other hand, we already produce invalid xhtml for any document which includes code blocks, as line numbers contain the aria-hidden="true" attribute.

jmuheim commented 4 years ago

Interesting. You mean because aria-hidden has a dash in the attribute name, right?

I don't know of a good technical work around. I could think of doing something like this which would work in some situations:

<figure>
  <img src="..." alt="See below" />
  <figcaption>Bla bla bla</figcaption>
</figure>

But this isn't really a general solution.

In my honest opinion though it is so much more important not to programmatically exclude users (especially users with special needs who already are suffering a lot of awkwardnesses), compared to having minor code invalidities. And as you're stating that there is already some aria-hidden in code blocks in HTML4, we should definitely not bother to add them for alternative texts.

mb21 commented 4 years ago

Is this issue only about HTML4 output, because I think much of the reason we do things the way we do them is because in HTML5 (which is the default), we produce a figure tag...

I guess that Pandoc concludes that having alternative text and caption would be redundant,

yes.

and that the caption below would be the right thing to show - which it is not

well.. why not? HTML5 output is:

<figure>
  <img src="foo.jpg" alt="" />
  <figcaption>bar</figcaption>
</figure>
jmuheim commented 4 years ago

well.. why not? HTML5 output is:

<figure>
  <img src="foo.jpg" alt="" />
  <figcaption>bar</figcaption>
</figure>

As far as I know, screen readers will always treat images with empty alt attribute as purely decorative, so the user will never know about them. For instance, they will not show them in a list of images or any other functionality that screen readers offer.

While it may seem counter intuitive to non-blind people, blind people also make use of images, e.g. saving them to their hard drive or uploading them to social media portals. So we should never prevent them to access the same elements like others do.

tarleb commented 4 years ago

Furthermore, here is what MDN says about the alt attribute.

Omitting alt altogether indicates that the image is a key part of the content and no textual equivalent is available. Setting this attribute to an empty string (alt="") indicates that this image is not a key part of the content (it’s decoration or a tracking pixel), and that non-visual browsers may omit it from rendering. Visual browsers will also hide the broken image icon if the alt is empty and the image failed to display.

Figures are rarely just decoration, and I think leaving users in the dark about the existence of an image seems not good.

mb21 commented 4 years ago

Pretty sure we actually changed this to the way it's currently after the request of a blind person generating ePub.... but cannot find the issue anymore...

tarleb commented 4 years ago

Found the issue: #4737

jgm commented 4 years ago

I didn't know til now that hyphenated attribute names aren't allowed in XHTML. Interesting. We do try to create polyglot HTML, and this is especially important because we use the HTML writer in creating EPUBs. EPUB contents are supposed to be XHTML. On the other hand, I haven't heard any reports that the hyphenated aria- attributes have caused problems with any e-readers or with epub validation.

tarleb commented 4 years ago

I tried two EPUB2 validators with current pandoc output, and they fail if the input contains a syntax highlighted code block. The PR therefore leaves the HTML4/XHTML output as it was, and just updates HTML5 output to include the suggested changes.

jmuheim commented 4 years ago

Any news on this? I will fix the issue on my side with some (ugly) JavaScript, looking out for the inaccessible code created by Pandoc and fixing it.

jmuheim commented 4 years ago

Just for the records: Instead of using JavaScript, I decided to put it into my markdown method in Ruby. This is faster, cleaner, and better suited for automated testing.

If anyone else needs an inspiration for a similar thing:

module MarkdownHelper
  def markdown(string)
    html = PandocRuby.convert(string).strip

    nokogiri = Nokogiri::HTML::DocumentFragment.parse(html)

    nokogiri = clone_alt_into_img_and_hide_figcaption_from_sr(nokogiri)
    nokogiri = add_empty_alt_to_decorative_img(nokogiri)

    nokogiri.to_html.html_safe
  end

  # Pandoc removes the content of an image's alt attribute, as the text is also available inside figcaption (to avoid screen reader redundancies). This is terrible though, as this renders the image itself invisible to screen readers. So we clone the alternative text back into the alt attribute again, and place an aria-hidden on figcaption.
  #
  # See https://github.com/jgm/pandoc/issues/6491
  def clone_alt_into_img_and_hide_figcaption_from_sr(nokogiri)
    nokogiri.css('figure').map do |figure|
      img        = figure.at_css('img')
      figcaption = figure.at_css('figcaption')

      img['alt'] = figcaption.text
      figcaption['aria-hidden'] = true
    end

    nokogiri
  end

  # Pandoc doesn't add an empty alt-attribute if the alternative text is left empty. Because screen readers announce the file name in this situation, we add an empty alt-attribute here.
  def add_empty_alt_to_decorative_img(nokogiri)
    nokogiri.css('img:not([alt])').map do |img|
      img['alt'] = ''
    end

    nokogiri
  end
end