jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.34k stars 3.37k forks source link

`meta-json` variable can't be used in a HTML template when some metadata contains HTML tags #8048

Open cderv opened 2 years ago

cderv commented 2 years ago

For the context, I am using meta-json variable to pass Pandoc metadata to HTML browser so that I can access the values in some JS processing. For this I am inserting this in my HTML template

<script id="pandoc-meta" type="application/json">$meta-json$</script>

which allow me to retrieve the metadata values in JS using

let el = document.getElementById('pandoc-meta');
pandocMeta = el ? JSON.parse(el.firstChild.data) : {};

This was working well until some document are setting header-includes for example. header-includes meta will contains usually some HTML code using HTML node tags. And this is does not play well to have HTML as it messes the parsing in browser.

Here is an example.

Let's render with

pandoc -t html test.md -o test.html --template template.html

The HTML produced is

<html>
    <head>
        <script id="pandoc-meta" type="application/json">
            {"header-includes":"<script>console.log(\"test\")</script>"}
        </script>
    </head>
    <body>
        <pre><code>{"header-includes":"<script>console.log(\"test\")</script>"}</code></pre>
    </body>
</html>

$meta-json$ was replaced by metadatas in JSON representation, but the HTML content is not escaped or encoded.

If you open the test.html document, you'll notice some problem in content parsed as the <script> in the JSON is seen as part of the HTML structure.

image

This leads me to a question: Is $meta-json$ not intended to be inserted into a HTML template ? It seemed like a good solution to pass some data to external tools processing the resulting HTML. My understanding of https://github.com/jgm/pandoc/commit/4361dc0245a65d4f24f2df062684cdb1a0c3bc5a is that it was one of the purpose.

Is there any way to select which meta to include in the JSON string ? Or any other to achieve something similar to export metadata ?

I don't know what it would imply to encode the resulting JSON string in meta-json for the output format inclusion.

Among solution, I found that escaping the closing </script> like <\/script> for inclusion in <script id="pandoc-meta" type="application/json"> seems to work (which is described in https://www.w3.org/TR/html4/appendix/notes.html#h-B.3.2 and still works). HTML escaping is another option at least using &lt; for <, which works for inclusion in body to show the metadata,

I did not found solution to process $meta-json$ to that purpose. Having a pipe function for template to do string replacement or string escaping would help.

Overall, is this something that could be improve ? Or is $meta-json$ just mean to be used to create a json data file from a template ? with only the variable ?

Thanks.

jgm commented 2 years ago

Correct, meta-json is treated as a variable value and inserted directly with no escaping. In your case you can work around this, I think, with

<script type="application/json">//<![CDATA[
   $meta-json$
//]]></script>
jgm commented 2 years ago

I agree that it might be desirable to escape it for the target format, but it's a bit tricky to anticipate what is needed for the full range of cases. e.g. there might be some formats in which you'd want to put meta-json in a special context where it shouldn't be escaped.

cderv commented 2 years ago

Unfortunately CDATA is not working. I don't think this has effect anymore for browser parsing the HTML , and especially it does not prevent the end tag </script> to be parsed. Neither in W3C HTML validator nor uing it in Pandoc template

I understand that auto-escaping is not ideal. A pipe function to help with such case would help to deal with variables in template, but I don't know if this is something possible.

And I understand the multiformat problem. JSON form of meta is ideal for HTML though, so I would have expect it to be useful in HTML file. Getting the metadata would require a processing anyway (to parse JSON), so encoding it would also fit with a processing of decoding.

Anyway, I understand this is not easy. It means $meta-json$ does not currently allow to use the JSON in HTML template unless we're sure there is not </script> tag. Just being able to transform </script> in <\/script> seems to work.

I will probably need to deal which each variable on its own, instead of getting the full set of them. But I don't think the value will be the same string as the one in meta-json, right ? I'll look into that.

I'll find other solution for now and rethink this - it does not seem straightforward for now to improve the current behavior.

RLesur commented 2 years ago

I agree that it might be desirable to escape it for the target format, but it's a bit tricky to anticipate what is needed for the full range of cases. e.g. there might be some formats in which you'd want to put meta-json in a special context where it shouldn't be escaped.

I wonder whether an escaping pipe would make sense (to be used in an HTML template, e.g. meta-json/HTMLescape)