jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.78k stars 3.39k forks source link

Update <pre class="name"><code> to HTML5 <pre><code class="language-name"> #3858

Open marc-medley opened 7 years ago

marc-medley commented 7 years ago

When converting Markdown to HTML using --no-highlight option with fenced_code_attributes flag enabled, then <pre class="name"><code> tags are generated.

This request is to update <pre class="name"><code> to generate W3C HTML5 recommendation example output syntax <pre><code class="language-name">.

For example, <pre class="markdown"><code> would become HTML5 <pre><code class="language-markdown">.

W3C HTML5 Recommendation: code element

Authors who wish to mark code elements with the language used, e.g. so that syntax highlighting scripts can use the right rules, can use the class attribute, e.g. by adding a class prefixed with "language-" to the element.

Code Example:

The following example shows how a block of code could be marked up using the pre and code elements.

<pre><code class="language-pascal">var i: Integer;
begin
   i := 1;
end.</code></pre>

Prism.js Basic Useage also illustrates use the same HTML5 recommendation example syntax

Therefore, it only works with <code> elements, since marking up code without a <code> element is semantically invalid. According to the HTML5 spec, the recommended way to define a code language is a language-xxxx class, which is what Prism uses.

mb21 commented 7 years ago

Can confirm on pandoc 1.19.2.1

$ echo -e '```html\nfoo\n```' | pandoc
<div class="sourceCode">
  <pre class="sourceCode html">
    <code class="sourceCode html">foo</code>
  </pre>
</div>

Yet:

$ echo -e '```html\nfoo\n```' | pandoc --no-highlight
<pre class="html"><code>foo</code></pre>
jgm commented 7 years ago

When you do

``` foo
bar
in pandoc, it's exactly equivalent to
bar

so `foo` is just a class.  We don't know whether it's meant to be the name of a language syntax or something else entirely.

So adding the `language-` prefix to all the classes of a code block certainly wouldn't be the right thing to do.  We could, I suppose, add the prefix to class names that correspond to known language names, i.e. to language names that pandoc's own highlighter is aware of.
marc-medley commented 7 years ago

We could, I suppose, add the prefix to class names that correspond to known language names, i.e. to language names that pandoc's own highlighter is aware of.

For my use case, this would be an OK approach.

Noting that, semantically, the html <code> tag seems to be an appropriate place for a language- class attribute.

jgm commented 7 years ago

We generally want the class on the pre, because highlighting styles often include a background color. It probably wouldn't hurt to put it on the code as well in spans, but it hasn't been necessary.

+++ Mauro Bieg [Aug 21 17 02:11 ]:

Can confirm on pandoc 1.19.2.1 $ echo -e 'html\nfoo\n' | pandoc

   foo
 

Yet: $ echo -e 'html\nfoo\n' | pandoc --no-highlight

foo

— You are receiving this because you are subscribed to this thread. Reply to this email directly, [1]view it on GitHub, or [2]mute the thread.

References

  1. https://github.com/jgm/pandoc/issues/3858#issuecomment-323690326
  2. https://github.com/notifications/unsubscribe-auth/AAAL5GKxbulfNz7p5G7_8wNsDBO3lj0Hks5saUnagaJpZM4O88oe
marc-medley commented 7 years ago

There are two distinct code highlighting use cases:

  1. Use Case default: Pandoc provides the complete code highlighting in html output.

  2. Use Case --no-highlight: Pandoc code highlighter is disabled. Pandoc produces an "intermediate" html. An external highlighter such as prism.js or highlight.js is later applied to the "intermediate" html when loaded into a viewing browser.

This particular issue is only intented to apply to the --no-highlight use case.

So, yes, the default use case should continue with what works for the Pacdoc highlighter. e.g. use <pre> for background color.

Yet, when the --no-highlight option is used then possible downstream highlighters should be considered.

For example, both highlight.js and prism.js can consume the following clean, simple, maintainable html and produce various colored backgrounds along with full syntax highlighting.

<pre><code class="language-css">p { color: red }</code></pre>

Please see highlight.js usage and demo (supports language-abc, lang-abc and abc)
Please see prism.js basic usage and examples

In both the highlight.js and prism.js examples, the <pre> tag does no have any additional attributes.

So, in the --no-highlight use case, the language- class placed in the <code> tag is sufficiently and complete for downstream highlighters such as prism.js and highlight.js to also provide background coloring in the final html delivered to the viewing browser.

bpj commented 7 years ago

Couldn't this be handled with a filter which adds the language- prefix to the first class, if any, of all CodeBlock elements, and overrides the builtin HTML rendering of code blocks? It would be very easy with Pandoc::Filter:

#!/usr/bin/env perl
use strict;
use warnings;
use Pandoc::Filter;
use Pandoc::Elements;
use HTML::Entities qw[ encode_entities ];

pandoc_filter 'CodeBlock' => sub {
    my $attrs = stringify_attrs($_); # here $_ is a reference to element object
    return unless length $attrs;     # default rendering OK
    my $content = encode($_->content);   # the code
    return RawBlock html => qq(<pre><code $attrs>$content</code></pre>);
};

sub stringify_attrs {
    my($elem) = @_;
    my $kv = $elem->keyvals; # get Hash::MultiValue object
    my @attrs;
    if ( my @classes = $kv->get_all('class') ) {
        $kv->remove('class');
        @classes = map {; encode($_) } @classes; # shouldn't be needed!
        push @attrs, qq(class="language-@classes");
    }
    ATTR:
    for my $attr ( sort keys %$kv ) {
        my @values = $kv->get_all($attr);
        next ATTR unless @values;
        push @attrs, map{ $_ = encode($_); qq($attr="$_"); } sort @values;
    }
    return "@attrs"; # array items as space-separated string
}

sub encode { encode_entities $_[0], '<>&"' }

Note that this requires that the author has the discipline to make sure that it always is appropriate, or at least doesn't break anything, to prefix language- to any first class of a codeblock.

Note also that I'm writing this on my tablet, so the code is untested but it should do the right thing.

gkjpettet commented 6 years ago

This seems to still be an issue with the current version of pandoc. Even using the --no-highlight option I'm still seeing the class added to the <pre> tag and no class added to the <code> tag.

averms commented 6 years ago

Here is a lua filter to do this. It passes through any classes that don't match a programming language name and all ids. Attributes are stripped, but I'm not sure too many people use them anyways.

Just use --lua-filter standard-code.lua

mrchypark commented 5 years ago

How is this issue going?

https://github.com/jgm/pandoc/issues/3858#issuecomment-324128577 <- this option looks good for me because I want to use highlight.js.

jgm commented 5 years ago

There are a couple of possibilities here:

  1. Change the HTML writer so that, when --no-highlight is used (i.e., writerHighlightStyle opts == Nothing), pandoc produces a language-LANG class on the code elements in both inline and block code.

    A question is how the language is identified. (A code span or block may have a number of classes, only one of which is the language -- or it may be that none of the classes are languages.) One possibility would be to check a list of known languages. We could, perhaps, include the list that highlighting-js currently supports.

    Anyway, on this approach you could write

    ``` C
    int i = 0;
    and it would be rendered
    ````html
    <pre><code class="language-C">int i = 0;
    </code></pre>
  2. Another approach would involve a much more minimal modification. This would simply move any class beginning with language- to the code tag instead of the pre tag, in rendering HTML. It would be insensitive to the setting of --no-highlight. With this approach you'd write

    ``` language-C
    int i = 0;
  3. Another idea would be to always add language- to a single word after the opening code backticks, so that

    ``` C
    int i = [;
    
    would be parsed as a code block with class `language-C` rather than `C`.  The logic for highlighting could be modified so that we first check the classes for `language-X`, then for known languages (so a class `C` would also work).  The main drawback of this approach is that it could break some current setups that are assuming that the class name will be `C`.
marc-medley commented 5 years ago

@jgm I would go with possibility 3., with 1. as a second choice, based on the following notes…

Possibility 1. any class after opening code backticks

Performance & Maintenance Issue: Looks up each class against some ever evolving language name list, like PrismJs⇗ or highlight.js⇗ supported languages.

Possibility 2. use language-LANG in markdown

Breaks Markdown Editing Highlights Issue: Breaks source and preview highlighting in many markdown editing environments. Widely used markdown fenced code syntax uses just the language name: c, java, swift, etc as the first word after the opening code fence.

Here is an example from editing markdown in Atom:

markdowncodefencing

Note: Requiring language- in markdown code fences breaks thousands of markdown files in my use case.

Possibility 3. use first word after opening code backticks

In my use case, the first word (if present) is the code language name.

Always add language- to a single word after the opening code backticks

Fenced C code in markdown input:

fencedc

Renders HTML5 recommendation compliant output:

<pre><code class="language-c">int i = 0;
</code></pre>

Note: may need to recognize no-highlight (in markdown) as a case for not adding any language highlight class when multiple classes are used after an opening code fence. (Just mentioned from completeness … for use cases which also have non language classes... although this is not my current use case.)

kiwi0fruit commented 5 years ago

I guess it's too late to worry about adding yet another command line option. So the best approach is

  1. may be use --no-highlight
    • ~move classes to <code> instead of <pre> when --no-highlight~ ,
  2. Add new CLI option --language-prefix that adds language-* to the first class.

~At the moment it's not fixable via pandoc filters: I need to iterate via beautiful soup to move class...~ Nope.

Both Highlight.js and Prism.js works with attributes set to <pre>

PS By the way: if there is something to worry about CLI options is that they are not in the alphabetical order in the --help

UPD: simple pandoc filter like this solves the issue.

jrtechs commented 4 years ago

I "fixed" this on my website using some hacky regex operators on the HTML produced by pandoc. However, it would be nice if pandoc added a flag to fix this.

                    var re = /\<pre class=".*?"><code>/;
                    while (result.search(re) != -1) // result is the html from pandoc
                    {
                        var preTag = result.match(/\<pre class=".*?"><code>/g)[0];
                        var finishIndex = preTag.split('"', 2).join('"').length;
                        lang = preTag.substring(12, finishIndex);
                        var newHTML = `<pre><code class="language-${lang}">`
                        var original = `<pre class="${lang}"><code>`;
                        result = result.split(original).join(newHTML);
                    }
krontzo commented 4 years ago

Both Highlight.js and Prism.js works with attributes set to <pre>

I am trying to have line-numbers in latest version of reveal.js. The included version of highlight.js supports line numbers but only in the <code> tag.

So, I have two questions:

jgm commented 4 years ago

You could do it with a filter, by replacing each CodeBlock element with a RawBlock (Format "html") and building the HTML yourself. A bit tedious, and you'd need to be careful about escaping, but not too hard.

krontzo commented 4 years ago

Thank you for the information and your quick response. I'll give it a try.

tarleb commented 4 years ago

I believe this should do the trick: https://github.com/pandoc/lua-filters/tree/master/revealjs-codeblock

krontzo commented 4 years ago

Thank you very much for the information. It works for me.