jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.94k stars 3.35k forks source link

dump metadata along with document fragment #2019

Closed zackw closed 8 years ago

zackw commented 9 years ago

I am looking for a way to get pandoc to dump out metadata along with a document fragment. This should behave as follows:

An example would probably help: given


---
authors: [joe bloggs, fred mbogo]
title: This title contains *emphasis* and $m$-ath
...

This is the body of the document

pandoc -t html5+metadata --mathml should produce something like

<!-- metadata:
{"authors":["joe bloggs","fred mbogo"],"title":"This title contains <em>emphasis</em> and <math><mrow><mi>m</mi></mrow></math>-ath"}
:metadata -->
<p>This is the body of the document</p>

It's quite possible that there's already a way to do something like this and I just can't find it, in which case I would appreciate a pointer.

A way to dump only the metadata, but still applying a rendering, would also be useful.

lierdakil commented 9 years ago

This could be done with a filter, e.g.

import Text.Pandoc.JSON
import Text.Pandoc
import Data.Aeson.Encode
import Data.Aeson.Types
import Data.ByteString.Lazy.UTF8
import Data.List
import qualified Data.Map as M

main :: IO ()
main = toJSONFilter inputMeta

inputMeta :: Pandoc -> Pandoc
inputMeta (Pandoc m b) = Pandoc m (mb:b)
  where
    mb = RawBlock (Format "html") $
      "<!-- metadata:\n" ++ toString (encode $ metaToJSON m) ++ "\n-->"

metaToJSON :: Meta -> Value
metaToJSON (Meta m) = toJSON $ M.map metaValueToJSON m

metaValueToJSON :: MetaValue -> Value
metaValueToJSON (MetaMap m) = toJSON $ M.map metaValueToJSON m
metaValueToJSON (MetaList xs) = toJSON $ map metaValueToJSON xs
metaValueToJSON (MetaString t) = toJSON t
metaValueToJSON (MetaBool b) = toJSON b
metaValueToJSON (MetaInlines ils) = toJSON $ toHtml ils
metaValueToJSON (MetaBlocks bs) = toJSON $ toHtml' bs

toHtml :: [Inline] -> String
toHtml ils = html
  where
    html = writeHtmlString options $ Pandoc nullMeta [Plain ils]

toHtml' :: [Block] -> String
toHtml' bs = writeHtmlString options $ Pandoc nullMeta bs

options :: WriterOptions
options = def{writerHTMLMathMethod=MathML Nothing}

I don't think this makes a ton of sense as a core functionality. I would, however, appreciate a built-in metaToJSON/metaValueToJSON, as well as methods to translate from Inlines to String for a given format without weird prefix-stripping. The latter doesn't make sense for all formats though. UPD: Silly me, there is Plain block-level element for that

lierdakil commented 9 years ago

Note, that whatever you want this for, you are probably better off just straight up writing a filter for it. You can choose between Haskell, Python, or in fact any language that can handle JSON input and output (e.g. NodeJS). Haskell and Python are supported though. You might want to look at http://johnmacfarlane.net/pandoc/scripting.html

zackw commented 9 years ago

I have experimented with this approach, and I think I can make it work, but it is suboptimal.

For context, I am trying to improve the metadata handling in liob/pandoc_reader, which uses Pandoc as the front end for a static site generator, Pelican; Pelican is written in Python. In this context, I am reluctant to require the Haskell compiler or the Pandoc libraries; the current code uses only the command-line tool.

Now, if I'm writing a filter in other-than-Haskell, I can't get at writeHtmlString, so the best I can do is some kind of AST-to-AST transformation that embeds the metadata in the HTML output, preserving its structure. For instance, I can translate MetaList to BulletList, and MetaMap and the top-level metadata object to DefinitionList, and wrap value types in Plain. Let me give a concrete example of the complex metadata I'm working with, and the output of the transformation I have written:

---
authors:
  - Li, Ninghui
  - Li, Tiancheng
  - Venkatasubramanian, S.
title: "$t$-Closeness: Privacy Beyond $k$-Anonymity and $l$-Diversity"
booktitle:
  shortname: ICDE 2007
  fullname: IEEE 23rd International Conference on Data Engineering, 2007
  url: http://www.computer.org/csdl/proceedings/icde/2007/0802/00/index.html
year: 2007
month: April
pages: 106--115
doi: 10.1109/ICDE.2007.367856
tags: [data privacy, database theory, attribute disclosure,
       $k$-anonymity, $l$-diversity, $t$-closeness]
...

body of document

becomes

<dl>
<dt>authors</dt>
<dd><ul>
<li>Li, Ninghui</li>
<li>Li, Tiancheng</li>
<li>Venkatasubramanian, S.</li>
</ul>
</dd>
<dt>booktitle</dt>
<dd><dl>
<dt>fullname</dt>
<dd>IEEE 23rd International Conference on Data Engineering, 2007
</dd>
<dt>shortname</dt>
<dd>ICDE 2007
</dd>
<dt>url</dt>
<dd>http://www.computer.org/csdl/proceedings/icde/2007/0802/00/index.html
</dd>
</dl>
</dd>
<dt>doi</dt>
<dd>10.1109/ICDE.2007.367856
</dd>
<dt>month</dt>
<dd>April
</dd>
<dt>pages</dt>
<dd>106--115
</dd>
<dt>tags</dt>
<dd><ul>
<li>data privacy</li>
<li>database theory</li>
<li>attribute disclosure</li>
<li><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>k</mi></mrow></math>-anonymity</li>
<li><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>l</mi></mrow></math>-diversity</li>
<li><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>t</mi></mrow></math>-closeness</li>
</ul>
</dd>
<dt>title</dt>
<dd><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>t</mi></mrow></math>-Closeness: Privacy Beyond <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>k</mi></mrow></math>-Anonymity and <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>l</mi></mrow></math>-Diversity
</dd>
<dt>year</dt>
<dd>2007
</dd>
</dl>
<hr />
<p>body of document</p>

which I would then split at the <hr /> (incidentally, why does -t html5 emit XMLisms?) and parse the top half of back into a data structure. This is less than ideal for two reasons. First, parsing HTML is significantly more complicated than parsing JSON as originally requested. There is no way to generate JSON with this approach, because there is no way to direct Pandoc to render the contents of an (ex-) MetaInlines or MetaBlocks as HTML and then quote it for JSON. Second, closely related, there's no definite break point between the HTML defining the data structure, and the HTML of each value. I may be able to patch around that with <span> or something, but it'll never be better than awkward.

zackw commented 9 years ago

Thinking out loud, a potential fix is (1) a new AST node type that means "render what's under this node in output format X and then quote it as a string literal for the surrounding context", (2) some way of generating a custom JSON tree (rather than a literal serialization of the AST). (1) might also be useful for, like, embedding examples of the rendered output in format X in a document of format Y.

The thing I originally asked for seems simpler overall, and easier to implement, though.

mpickering commented 9 years ago

Can you not write a filter which just dumps the JSON to a file? If you then really want them in the same file you can then just cat the metadata dump and the output pandoc produces together.

zackw commented 9 years ago

@mpickering The JSON structure passed to the filter is

[{ "unMeta": {
    "title": {"t":"MetaInlines","c":[
        {"t":"Math","c":[{"t":"InlineMath","c":[]},"t"]},
        {"t":"Str","c":"-Closeness:"},
        {"t":"Space","c":[]},
        {"t":"Str","c":"Privacy"},
        {"t":"Space","c":[]},
        {"t":"Str","c":"Beyond"},
        {"t":"Space","c":[]},
        {"t":"Math","c":[{"t":"InlineMath","c":[]},"k"]},
        {"t":"Str","c":"-Anonymity"},
        {"t":"Space","c":[]},
        {"t":"Str","c":"and"},
        {"t":"Space","c":[]},
        {"t":"Math","c":[{"t":"InlineMath","c":[]},"l"]},
        {"t":"Str","c":"-Diversity"}
    ]},
    "authors": {"t":"MetaList","c":[
        {"t":"MetaInlines","c":[
            {"t":"Str","c":"Li,"},
            {"t":"Space","c":[]},
            {"t":"Str","c":"Ninghui"}
        ]},
        {"t":"MetaInlines","c":[
            {"t":"Str","c":"Li,"},
            {"t":"Space","c":[]},
            {"t":"Str","c":"Tiancheng"}
        ]},
        {"t":"MetaInlines","c":[
            {"t":"Str","c":"Venkatasubramanian,"},
            {"t":"Space","c":[]},
            {"t":"Str","c":"S."}]}
    ]}
    // ...
}},
[/*body of document here */]]

The JSON structure I want is

{
    "title": "<math display=\"inline\"><mrow><mi>t</mi></mrow></math>-Closeness: Privacy Beyond <math display=\"inline\"><mrow><mi>k</mi></mrow></math>-Anonymity and <math display=\"inline\"><mrow><mi>l</mi></mrow></math>-Diversity",
    "authors": [
        "Li, Ninghui",
        "Li, Tiancheng",
        "Venkatasubramanian, S."
    ],
    // ...
}

The only way to get to B from A is to pass back through Pandoc's HTML generator.

jgm commented 9 years ago

I haven't followed this in detail, but if you know the format of the structure ahead of time, could you just write a custom template? Note that template variable values will be interpreted as Markdown. Templates provide a mechanism for iterating across arrays and for object/property structures.

zackw commented 9 years ago

if you know the format of the structure ahead of time, could you just write a custom template?

I don't know the structure ahead of time; it appears that there is no way to iterate over all available variables, nor discriminate variables by origin, nor to recursively walk an unknown tree structure.

Also, it appears that there is no way to request any sort of syntactic quotation.

lierdakil commented 9 years ago

I am reluctant to require the Haskell compiler or the Pandoc libraries; the current code uses only the command-line tool.

Look, in absolute majority of cases, if pandoc is installed, so is haskell runtime. It means that, at the very least, you can run haskell filters through pandoc itself. It is suboptimal in terms of speed, but since it's not used for dynamic content generation of some sort, it shouldn't be a big concern.

jgm commented 9 years ago

+++ Nikolay Yakimov [Mar 24 15 07:28 ]:

I am reluctant to require the Haskell compiler or the Pandoc libraries; the current code uses only the command-line tool.

Look, in absolute majority of cases, if pandoc is installed, so is haskell runtime.

Many people install pandoc with the binary packages, which don't install Haskell runtime. Of course, you can always shell out to pandoc in your filter.

jgm commented 9 years ago

The following change to the HTML writer would add a meta-json template variable containing a JSON version of the formatted metadata:

diff --git a/src/Text/Pandoc/Writers/HTML.hs b/src/Text/Pandoc/Writers/HTML.hs
index 53dc931..93834c1 100644
--- a/src/Text/Pandoc/Writers/HTML.hs
+++ b/src/Text/Pandoc/Writers/HTML.hs
@@ -43,6 +43,8 @@ import Text.Pandoc.XML (fromEntities, escapeStringForXML)
 import Network.URI ( parseURIReference, URI(..), unEscapeString )
 import Network.HTTP ( urlEncode )
 import Numeric ( showHex )
+import qualified Data.Aeson as Aeson
+import Text.Pandoc.UTF8 (toStringLazy)
 import Data.Char ( ord, toLower )
 import Data.List ( isPrefixOf, intersperse )
 import Data.String ( fromString )
@@ -194,6 +196,7 @@ pandocToHtml opts (Pandoc meta blocks) = do
                   defField "revealjs-url" ("reveal.js" :: String) $
                   defField "s5-url" ("s5/default" :: String) $
                   defField "html5" (writerHtml5 opts) $
+                  defField "meta-json" (toStringLazy $ Aeson.encode metadata) $
                   metadata
   return (thebody, context)

This could be used with a custom template like

<!--
$meta-json$
-->
$body$

to get what @zackw is looking for.

So, one possible change to pandoc would be to define a meta-json variables in all writers. Rather than changing all the writers one by one, it would make sense to modify the metaToJSON function. I can see how this would make it easier to integrate pandoc with other things, like static site generators. What do people think?

jgm commented 9 years ago

Better, more general, patch, affecting all writers:

diff --git a/src/Text/Pandoc/Writers/Shared.hs b/src/Text/Pandoc/Writers/Shared.hs
index 800e741..cc9e59d 100644
--- a/src/Text/Pandoc/Writers/Shared.hs
+++ b/src/Text/Pandoc/Writers/Shared.hs
@@ -45,7 +45,8 @@ import Text.Pandoc.Options (WriterOptions(..))
 import qualified Data.HashMap.Strict as H
 import qualified Data.Map as M
 import qualified Data.Text as T
-import Data.Aeson (FromJSON(..), fromJSON, ToJSON (..), Value(Object), Result(..))
+import Data.Aeson (FromJSON(..), fromJSON, ToJSON (..), Value(Object), Result(..), encode)
+import Text.Pandoc.UTF8 (toStringLazy)
 import qualified Data.Traversable as Traversable
 import Data.List ( groupBy )

@@ -67,7 +68,8 @@ metaToJSON opts blockWriter inlineWriter (Meta metamap)
     renderedMap <- Traversable.mapM
                    (metaValueToJSON blockWriter inlineWriter)
                    metamap
-    return $ M.foldWithKey defField baseContext renderedMap
+    let metadata = M.foldWithKey defField baseContext renderedMap
+    return $ defField "meta-json" (toStringLazy $ encode metadata) metadata
   | otherwise = return (Object H.empty)

 metaValueToJSON :: Monad m
zackw commented 9 years ago

I like this as long as it does the Right Thing with complicated quoting cases like

---
title: "`<!-- HTML Comments And You -->`: An \"Informal\" Discussion"
author: Alice & Bob
...

this being only what I could think of off the top of my head, I'm sure there are nastier constructs.

jgm commented 9 years ago

+++ Zack Weinberg [Mar 28 15 15:22 ]:

I like this as long as it does the Right Thing with complicated quoting cases like


title: "<!-- HTML Comments And You -->: An \"Informal\" Discussion" author: Alice & Bob ...

this being only what I could think of off the top of my head, I'm sure there are nastier constructs.

It should, because we're using a robust and well tested json library to generate the json.

chriskrycho commented 9 years ago

Adding a :+1: here, because I have very similar needs to those outlined by @zackw, and he and I ended up independently working around the issue elsewhere (see liob/pandoc_reader#3, liob/pandoc_reader#4, and liob/pandoc_reader#5). I also note that integration with other possible static site generators could be a big win, since pandoc is in my experience meaningfully faster than many other implementations. (E.g. it moves at least twice as fast as the standard Python Markdown implementation—I just compared the two on a ~16k-line test file with as close to the same settings for parsing as possible, and it runs in half the time. For a file 10⨉ that size… well, Python Markdown just falls down; it never finished. :stuck_out_tongue:)

bpj commented 9 years ago

Note that you don't need a filter to dump the metadata to a file. All you need is a pandoc template looking like this (call it 'YAML.markdown'):

$if(titleblock)$
$titleblock$
$else$
--- {}
$endif$

then invoke pandoc with

pandoc -t markdown --template=yaml document.MD -o metadata.yaml

and then decode 'metadata.yaml' with your nearest YAML parser!

The $else$ part makes sure you always get an associative array, possibly empty. Den 23 mar 2015 20:37 skrev "Matthew Pickering" notifications@github.com:

Can you not write a filter which just dumps the JSON to a file? If you then really want them in the same file you can then just cat the metadata dump and the output pandoc produces together.

— Reply to this email directly or view it on GitHub https://github.com/jgm/pandoc/issues/2019#issuecomment-85161235.

chriskrycho commented 9 years ago

@bpj That's true in a general sense, but it doesn't get at the issue here, and it certainly doesn't give you the data back in a format (e.g. JSON) readily transformed or handed around within another application, which is the context which drove @zackw's request (and is my interest as well): both of us are using pandoc to drive Pelican, and are doing a bit of a dance to handle YAML metadata in that context.

jgm commented 8 years ago

I've added meta-json. So now a template with just $meta-json$ will give you the document's metadata in JSON format (formatted according to the writer).