jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.7k stars 3.39k forks source link

Proposal for changes to pandoc's metadata #851

Closed jgm closed 11 years ago

jgm commented 11 years ago

Motivation

There have been many calls for more flexibility in metadata.

Currently we have

data Meta = Meta [Inline] [[Inline]] [Inline]

where the first parameter is the title, the second the list of authors, and the third the date.

But, in addition to title, author, and date, documents have other metadata (ISBN, abstract, organization, license, etc.). It would be nice to be able to store this information in the document.

In addition, applications that use pandoc (e.g. gitit, hakyll) tend to need metadata. Currently this needs to be stripped off before the files can be processed with pandoc, which is inconvenient.

Pandoc currently allows the setting of custom template variables on the command line, but there is not any way to set them in a document, because metadata is limited to title, author, and date. If all metadata fields were passed to templates as variables, authors would have great flexibility to create custom solutions to problems. For example, a letter template could have fields for return address, address, salutation, etc.

The proposed changes follow:

Change the Meta type to a simple association list

data Meta = Meta [(String, [Block])]

We no longer privilege title, author, date. A pandoc document can contain indefinitely many metadata items.

Why [Block]? Some fields may take longer values, e.g. an abstract can be multiple paragraphs, so [Block] is more appropriate than [Inline].

Why an association list instead of a map? We want to allow several entries with the same key (e.g. multiple "author" entries). The lists typically won't be long enough to affect performance.

This would be a major API change (and require a new version of pandoc-types). We'd need to increment to 1.12. Perhaps, to ease the transition, it would be useful to define some utility functions -- docTitle, docAuthors, docDate -- that extract metadata in more or less the form it had in the old Meta type.

Retrofit pandoc title blocks to use the new Meta type

Pandoc title blocks will now just add records to the association list. The title will add ("title", [Plain inlines]), the authors will add multiple records ("author", [Plain inlines]), and the date will add ("date", [Plain inlines]).

So existing documents should work as before. Pandoc title blocks will continue to be supported, and can be used in conjunction with the YAML metadata described below.

Retrofit MMD title blocks to use the new Meta type

This should be straightforward. (Do we want to impose some discipline on the key names, to ensure that they can all be used in templates? If so, we might need some conversion scheme. Note also that MMD metadata is unstructured strings.)

Change writers to set template variables for all metadata fields

Currently only predefined variables are set.

Note: variables specified on the command line should override those set in the document. (Or should they?)

Add a new, flexible YAML metadata format in pandoc markdown

YAML Metadata will look like this:


---
title: "Our Book:  A Bunch of Words"
author:
  - Laura Smith
  - James Munoz
abstract: |
  This is a description of our book.

  It has two paragraphs
code_:  '*(<#$(*&(*#$*&'
structured:
  data:
    - is
    - fun
...

Metadata sections may occur anywhere in the document.

A document may contain multiple metadata sections.

Contents of metadata sections will be parsed as YAML (using Data.Yaml). The metadata must be a valid YAML object. Any valid YAML object will be parsed, but only certain fields will create metadata records (see below). YAML escaping rules are in effect. Hence, for example, we need quotes around the title above, because it contains a colon. The quotes are not part of the title.

Which fields create metadata records?

This system allows arbitrary structured metadata to be put into the document. (One could, for example, include a whole bibtex bibliography.) However, only fields that might be used in the rendition of the document in an output format are actually stored in the pandoc document. The rest are ignored by pandoc but may be used by other programs (gitit, hakyll, etc.).

Sample implementation

Here is a test program. Feed it some YAML and it will output something close to what I'm envisioning for the corresponding pandoc metadata:

import qualified Data.Yaml as Yaml
import qualified Data.ByteString.Char8 as B
import qualified Data.Map as M
import qualified Data.HashMap.Strict as H
import qualified Data.Text as T
import qualified Data.Vector as V
import Control.Monad
import Text.Pandoc hiding (Meta)

data Meta = Meta [(String, [Block])] deriving Show

instance Yaml.FromJSON Meta where
  parseJSON (Yaml.Object v) = return $ Meta $ reverse $ H.foldlWithKey' f [] v
    where f acc key val
            | (T.pack "_") `T.isSuffixOf` key = acc
            | otherwise = case val of
                              (Yaml.String t)  -> (toKey key,
                                                   toBlocks $ T.unpack t):acc
                              (Yaml.Number n)  -> (toKey key,
                                                   [Plain [Str $ show n]]):acc
                              (Yaml.Bool True) -> (toKey key,
                                                   [Plain [Str "true"]]):acc
                              (Yaml.Bool False) -> acc
                              (Yaml.Array xs)  -> V.foldl (\acc' x -> f acc' key x) acc xs
                              _ -> acc
          toBlocks x = let (Pandoc _ bs) = readMarkdown def x
                       in  bs
          toKey = T.unpack . T.toLower
  parseJSON _ = mzero

main = do
  inp <- B.getContents
  let res :: Maybe Meta
      res = Yaml.decode inp
  print res

Note: This issue supplants #419, which contains some useful discussion and links.

pvorb commented 11 years ago

Nice to see that finally there is a solution that will work for everybody. Thank you for long-term effort on this.

nichtich commented 11 years ago

Is there or will there be a command line argument to read metadata from an existing YAML/INI/JSON file in addition to Markdown header?

cboettig commented 11 years ago

This looks fantastic. Looking forward to it :+1:

jgm commented 11 years ago

These changes have mostly been implemented, with a few differences.

The metadata type is now structured, allowing nested lists and field/value mappings, instead of a simple association list.

Also, YAML metadata is currently allowed only at the beginning of the file.

Closing this issue. Requests for changes should be directed to pandoc-discuss. Bugs should be reported as separate issues here.

towolf commented 11 years ago

Just tried it out with some correspondence letters. It works just as it should.

Thank you

paul-r-ml commented 11 years ago

Hi John, I just found this page, after following links from #83 . This looks great, thank you very much. I have a project undergoing, that will need this new metadata feature. Do you have a release date in mind, by any chance ? All the best,

jgm commented 11 years ago

I plan a release some time in August.

+++ paul-r-ml [Jul 24 13 07:53 ]:

Hi John, I just found this page, after following links from [1]#83 . This looks great, thank you very much. I have a project undergoing, that will need this new metadata feature. Do you have a release date in mind, by any chance ? All the best,

— Reply to this email directly or [2]view it on GitHub. [xJAuenYDiIoVt3LF3y68427TBDY_e21huiBbqabQs3cz7fe62I0Ida0hwA42wsgt.gif]

References

  1. https://github.com/jgm/pandoc/issues/83
  2. https://github.com/jgm/pandoc/issues/851#issuecomment-21490386
cboettig commented 11 years ago

Hi John,

Thanks much for this change, the yaml metadata is a very powerful addition to pandoc.

Would there be any chance you would consider using --- for both the opening and closing delimiters of the YAML frontmatter, as is already done by Jekyll and now supported in Github's previews: https://github.com/blog/1647-viewing-yaml-metadata-in-your-documents ? Or is there a good reason not use this format?

It would be very nice for pandoc metadata to display properly on Github, be compatible with Jekyll's metadata, and more generally build momentum around a standard notation for this. (It seems the perpetual curse of markdown that no two people will implement the same extension in the same way.)

Thanks for your creation and continued support of such a fantastic piece of software.

jgm commented 11 years ago

You can already use either --- or ... for the closing delimiter.

+++ Carl Boettiger [Sep 27 13 13:19 ]:

Hi John,

Thanks much for this change, the yaml metadata is a very powerful addition to pandoc.

Would there be any chance you would consider using --- for both the opening and closing delimiters of the YAML frontmatter, as is already done by Jekyll and now supported in Github's previews: https://github.com/blog/1647-viewing-yaml-metadata-in-your-documents ? Or is there a good reason not use this format?

It would be very nice for pandoc metadata to display properly on Github, be compatible with Jekyll's metadata, and more generally build momentum around a standard notation for this. (It seems the perpetual curse of markdown that no two people will implement the same extension in the same way.)

Thanks for your creation and continued support of such a fantastic piece of software.


Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/851#issuecomment-25274174

cboettig commented 11 years ago

thanks! and my apologies for missing that in the documentation.

On Fri, Sep 27, 2013 at 1:34 PM, John MacFarlane notifications@github.comwrote:

You can already use either --- or ... for the closing delimiter.

+++ Carl Boettiger [Sep 27 13 13:19 ]:

Hi John,

Thanks much for this change, the yaml metadata is a very powerful addition to pandoc.

Would there be any chance you would consider using --- for both the opening and closing delimiters of the YAML frontmatter, as is already done by Jekyll and now supported in Github's previews: https://github.com/blog/1647-viewing-yaml-metadata-in-your-documents ? Or is there a good reason not use this format?

It would be very nice for pandoc metadata to display properly on Github, be compatible with Jekyll's metadata, and more generally build momentum around a standard notation for this. (It seems the perpetual curse of markdown that no two people will implement the same extension in the same way.)

Thanks for your creation and continued support of such a fantastic piece of software.


Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/851#issuecomment-25274174

— Reply to this email directly or view it on GitHubhttps://github.com/jgm/pandoc/issues/851#issuecomment-25275164 .

Carl Boettiger UC Santa Cruz http://carlboettiger.info/