jgm / djot

A light markup language
https://djot.net
MIT License
1.67k stars 43 forks source link

Metadata #35

Open jgm opened 2 years ago

jgm commented 2 years ago

Should there be a built-in format for metadata, or should that be considered distinct from the markup syntax?

If so, what?

Do we need structured keys such as YAML provides? Would be nice to avoid the complexity of YAML, but otherwise YAML is nice for this. Maybe some simplified subset of YAML.

uvtc commented 2 years ago

Is the purpose of the metadata block to set variables in the standalone output doc template? (I'm thinking here of my rough understanding of how Pandoc works.)

My understanding is that YAML is a rather complex format. What about TOML?

jgm commented 2 years ago

I don't much like TOML for this purpose; it requires you to quote strings, and it makes it very inconvenient to represent e.g. an array of references.

dumblob commented 2 years ago

YAML has bad handling of anything including newlines. So "simplified subset of YAML" would not solve the issue TOML does solve. But I agree that quoting stuff is annoying.

One thing that I am missing is to spread metadata across the document - for longer documents one loses context if you have to put all the metadata at the very beginning of the document. So my requirement would be support multiple metadata blocks instead of just one.

Btw. how about just "reusing" existing formatted blocks with a reserved format keywoard (might be a "symbol" instead latin text)? Instead of cpp one would use perhaps > or # or whatever and done.

bpj commented 2 years ago

I needed a way to load configuration into my Pandoc filters without needing to "revert" Pandoc metadata trees to "plain" data. I couldn't use an existing YAML parser and soon despaired about writing a parser for full YAML. What I did succeed in writing was a parser for a basic subset of flow-style YAML without unquoted strings and tags, so basically JSON plus YAML-style hex integers and single and double quoted strings, giving the main advantages of YAML over JSON (including getting rid of the odious surrogate pair escapes!) without the significant indentation. The lpeg/re grammar isn't terribly large (copied from my Moonscript file):

yson_re = re.compile [===[ -- @start-re
  input <- value / not_a_value
  value <- (
      %s*
      ( string
      / number
      / array
      / object
      / 'true'  -> true
      / 'false' -> false
      / 'null'  -> null
      )
      %s*
    / %s* not_a_value
    )
  string <- ( single / double )
  single <- {| "'" ( { [^']+ } / "'" { "'" } )* "'" |} -> concat
  double <- {|
      '"' (
        { [^"\]+ }
      / { '\' ["\/bfnrt0aveN_LP %t] } -> esc
      / ( '\x' { %x^2 } -> hex_char
        / '\u' { %x^4 } -> hex_char
        / '\U' { %x^8 } -> hex_char
        )
      / bad_esc
      )* '"'
    |} -> concat
  number <- {
      '-'?
      ( '0x' %x+
      / ( '0' / [1-9] %d* )
        ( '.' %d+ )?
        ( [eE] [-+]? %d+ )?
      )
    } -> tonumber
  object <- {| '{' %s* '}' / '{' kv ( ',' kv )* '}' |} -> object
  kv <- {|
      ( %s* !string not_a_key )?
      %s* {:k: string :} %s* ( !':' bad )? ':'
      %s* {:v: value :} %s* ( ![,}] bad )?
      ( &',' !( ',' %s* string ) bad )?
    |}
  array <- {|
      '[' %s* ']'
    / '[' {| value |} ( ![],] bad )?
      ( ',' {| value |} ( ![],] bad )? )* ']'
    |} -> array
  bad_esc <- {|
      {:pos: {} :}
      {:msg: { '\' . } -> 'Unknown or invalid escape "%1"' :}
    |} => fail
  not_a_key <- {|
      %s* {:pos: {} :}
      {:msg: { %S* } -> 'Expected key (string) near "%1"' :}
    |} => fail
  not_a_value <- {|
      %s* {:pos: {} :}
      {:msg: { %S* } -> 'Expected value near "%1"' :}
    |} => fail
  bad <- {|
      %s* {:pos: {} :}
      {:msg: { %S } -> 'Unexpected "%1"' :}
    |} => fail
-- @stop-re ]===],
jgm commented 2 years ago

reStructuredText does something interesting here. They re-use definition list syntax; when a definition list occurs right after the document title, it is interpreted as metadata (IIRC).

Nice thing is that we already have a nice readable syntax for that.

bpj commented 2 years ago

@jgm wrote:

reStructuredText does something interesting here. They re-use definition list syntax;

Nice!

when a definition list occurs right after the document title, it is interpreted as metadata (IIRC).

Can't say I like that, since it should be legal to place a definition list as the first thing in the document, so some delimiter (three or more of some punctuation character!) would seem in order. Are ~~~ or +++ taken?

Nice thing is that we already have a nice readable syntax for that.

Yes. Some questions:

-   Would multiple "definitions" become a list?

-   and a nested "definition list" a nested mapping?

-   Would values be verbatim or be parsed as markup? If the latter there should IMO

1.  be a way to mark a value as a raw string, maybe

: raw

`This is a simple raw string`{=}

: more raw

```=
This is a multi-line
raw string


2.  Be a way to mark a nested definition list as an actual definition list in the value, maybe by giving it an attribute block, which may contain just a comment.

-   Would/could a bibliography (cf. #32) be included in metadata? I think it should also use definition list syntax but have its own block delimiter (maybe `@@@` if `@` marks a reference as such).

-   Might it be possible to store values which look like numbers as numbers? In Lua terms `val = tonumber(val) or val`.

-   Might it be possible to have metadata contain booleans, and if so how would they be represented?
uvtc commented 2 years ago

Are ~~~ or +++ taken?

~~~ currently works as a delimiter for code blocks.

uvtc commented 2 years ago

Are ~~~ or +++ taken?

I think the +++ would work well for metadata. It's a good punctuation character to use for a fence. It's not terribly pretty, but that's ok since metadata blocks are not terribly common, and should probably draw attention when they are present. And the + sign makes me think of something that's being added (here, the metadata).

marrus-sh commented 2 years ago

my current (Makefile‐based) workflow involves cat-ing a number of YAML files onto the front of a Markdown document prior to it being read in by Pandoc. i’m not too attached to YAML as a format, but it would be nice to support append‐only solutions for providing metadata (i.e., ones which don’t require any processing of the file itself). this means:

nested metadata is useful in my experience for namespacing, although

foo:
  bar: etaoin
  baz: shrdlu

can usually be represented as

foo-bar: etaoin
foo-baz: shrdlu

supporting lists/arrays is more important, as they are more difficult to represent through alternate means

dbready commented 1 year ago

In case there is still doubt about the topic, I am highly in favor of document metadata being within the document.

Is there some reason that the comment character could not be co-opted to serve as a docstring for metadata? Comment block at the start of the document can contain whatever syntax is chosen to define key:values. In Rust, a // is a standard comment, but a /// notes a docstring, giving a cheap way to detect it. Then again, I believe many sins have been committed by utilizing comment blocks for data.

Anyway, big fan of the project, and I am waiting on the sidelines for the eventual release.

matklad commented 1 year ago

Another option -- we already have syntax to associate arbitrary metadata with elements: attribute {.foo #bar baz="quux"} syntax. We just don't have a nice way to attach that to the document as a whole, but I think we can do something like "if the doc starts with attributes and they are followed by a blank line, the attributes belong to the document's node":

{
  author="matklad"
  date="2022-11-03"
}

# Consider using Djot for your next presentation
jgm commented 1 year ago
{
  author="matklad"
  date="2022-11-03"
}

One beautiful thing about this is that (with the addition of a single comma) it's a valid Lua table. Not that that matters. But I suggested a metadata format like this on markdown-discuss 15 years ago.

However, I think it's important to consider what types of data will go into the metadata fields. Our attributes are just strings. But string content isn't adequate for metadata. E.g., titles will often contain formatting like emphasis, and abstracts can even contain paragraphs and lists.

matklad commented 1 year ago

E.g., titles will often contain formatting like emphasis, and abstracts can even contain paragraphs and lists.

My gut response here would be to leave these kinds of metadata to the processors. Eg,

# Title With _Inlines_

::: abstract

some table or what not

::: 

and let the specific rendered to interpret abstract as metadata, and pull title there as well.

dbready commented 1 year ago

One beautiful thing about this is that (with the addition of a single comma) it's a valid Lua table

I love having a way to serialize data without a new bespoke syntax. One nice thing about Markdown documents that embed YAML/TOML in the preface is that I can easily read/export that format without a new parser. Lua tables (with nil) feels great.

bpj commented 1 year ago

I like the idea about using attribute syntax a lot, but less so the idea that it be a Lua table. Would that mean that Lua escapes are legal in the string? I assume \<punct> escapes are already legal in attributes, while Lua only supports \" \' \\, and what about \n and the like? In fact Lua table syntax isn't all that portable: you do need e.g. a JSON library to exchange data with other languages.

jgm commented 1 year ago

Nobody wants to put an abstract into something like a JSON string, escaping newlines etc. One nice thing about a Lua table is that you actually could do

{
  abstract = [[This is my
abstract.

It has multiple paragraphs.]]
chrisjsewell commented 1 year ago

Heya, just my two-cent 😅

I think it might be helpful to compare a representative "in the wild" Markdown front-matter.

I feel YAML is certainly the most "readable", but this obviously comes with the unfortunate over-complexities for parsing. Perhaps a subset of YAML would be nice, removing some of the more problematic features, as in https://hitchdev.com/strictyaml/features-removed/ 🤔

YAML

version: 1
title: My Document
author:
- name: Author One
  affiliation: University of Somewhere
- name: Author Two
  affiliation: University of Nowhere
abstract: |
    This is my very,
    very, very, long abstract...
toc: true
format: 
  html: 
    # some comment ...
    code-fold: true
    html-math-method: katex
  pdf: 
    geometry: 
    - top=30mm
    - left=20mm

TOML

version = 1
title = "My Document"
abstract = """This is my very,
very, very, long abstract...
"""
toc = true

[[author]]
name = "Author One"
affiliation = "University of Somewhere"

[[author]]
name = "Author Two"
affiliation = "University of Nowhere"

[format.html]
# some comment ...
code-fold = true
html-math-method = "katex"

[format.pdf]
geometry = [ "top=30mm", "left=20mm" ]

Lua Table

{
  version = 1,
  title = "My Document",
  author = {
    {
      name = "Author One",
      affiliation = "University of Somewhere"
    },
    {
      name = "Author Two",
      affiliation = "University of Nowhere"
    }
  },
  abstract = [[
This is my very,
very, very, long abstract...
]] ,
  toc = true,
  format = {
    html = {
      -- some comment...
      ["code-fold"] = true,
      ["html-math-method"] = "katex"
    },
    pdf = {
      geometry = { "top=30mm", "left=20mm" }
    }
  }
}

JSON

(no comments allowed)

{
  "version": 1,
  "title": "My Document",
  "abstract": "This is my very,\nvery, very, long abstract...\n",
  "toc": true,
  "author": [
    {
      "name": "Author One",
      "affiliation": "University of Somewhere"
    },
    {
      "name": "Author Two",
      "affiliation": "University of Nowhere"
    }
  ],
  "format": {
    "html": {
      "code-fold": true,
      "html-math-method": "katex"
    },
    "pdf": {
      "geometry": [
        "top=30mm",
        "left=20mm"
      ]
    }
  }
}
dbready commented 1 year ago

If leaning on an existing format, the chief benefit is being able to read/write document metadata without a bespoke parser. Is StrictYAML codified where this would be an option in other languages? Similar problem for JSON – I think supporting comments should be a goal, but most JSON parsers do not support a comment syntax. Perhaps JSON5 is standardized enough to be considered?

Then again, djot is an entirely new format which already requires a custom parser, but it would be nice to get the metadata formatting for free.

mcookly commented 1 year ago

Nobody wants to put an abstract into something like a JSON string, escaping newlines etc. One nice thing about a Lua table is that you actually could do

{
  abstract = [[This is my
abstract.

It has multiple paragraphs.]]

If the metadata is a lua table, would the parser be able to evaluate functions within it? If so, this might be a great feature for things like datetime or time-based UUIDs. I use markdown + YAML a lot for zettelkasten notes and academic writing (with pandoc); a functional metadata can really extend a textfile's usage cases.

Also, I just stumbled on this project a few days ago and love its potential and vision! Keep up the awesome work!

dbready commented 1 year ago

1) I do not like the idea of executable code in the document. Use cases of that nature seem more appropriate to an extension mechanism. If someone wants to embed a block of code in the front-matter and evaluate it, that should be possible, but not the default. 2) While Lua is the current implementation and being discussed as a serialization format, I do not expect Lua semantics to carry through. That is, would a Python/Javascript/Rust djot parser have to embed Lua so as to properly render a document?

bpj commented 1 year ago

I do not like the idea of executable code in the document.

Me neither, at least not by default. It might be somewhat less scary if executed in a custom environment insulated from the file system, but that might be severely limiting when you cannot load modules. An alternative might be a custom variable interpolation or even template system with limited capabilities. I have written such a processor for MoonScript/Lua but it uses Lpeg/re and as such is not appropriate for djot. Before Pandoc included lpeg/re in its Lua API I had written a parser in pure MoonScript/Lua but it was a lot of code: 700+ lines, a whole parser implementation of its own. With lpeg/re I'm down to about 300 lines not counting what is done by the lpeg/re modules, which still is at the upper bound for what I'm comfortable with inlining into a Pandoc filter. That includes a mechanism for pluggable functions and some default functions, which make up around a third of the code. I usually add around 20-60 lines of extra functions and variable data, and that's a MoonScript class, so I'm back at some 700 lines of Lua code, plus dependency on lpeg/re.

mcookly commented 1 year ago

Leaving executable code as an extension makes sense. And if djot's parsers are moving away from lua as @dbready mentioned, embedding lua just to read metadata seems extraneous. I don't think any of the other common metadata formats allow for code execution natively, and they probably prevent this for good reason.

If metadata code execution is left to the program, then you can just pass in code through the program's custom metadata field, like pandoc's header-includes. And if djot will be adding its own native serialization format, I assume it could allow passing in code blocks / inline code through the metadata. Either way, code is not directly executed when rendering the document.

tmke8 commented 1 year ago

There's also Hjson which looks like this:

version: 1
title: My Document
abstract:
  '''
  This is my very,
  very, very, long abstract...

  '''
toc: true
author: [
  {
    name: Author One
    affiliation: University of Somewhere
  },
  {
    name: Author Two
    affiliation: University of Nowhere
  }
],
format: {
  # some comment ...
  html: {
    code-fold: true,
    html-math-method: katex
  },
  pdf: {
    geometry: [
      "top=30mm", "left=20mm"
    ]
  }
}

It's basically json, but it doesn't require quoting keys and it has comments and nice multi-line strings.

mcookly commented 1 year ago

Another potential choice is NestedText. It's designed to be simple to parse yet still humanly readable (based on YAML). Here's an example:

version: 1
title: My Document
abstract:
  > This is my very,
  > very, very, long abstract...
toc: true
author:
  -
    name: Author One
    affiliation: University of Somewhere
  -
    name: Author Two
    affiliation: University of Nowhere
format:
  # Some comment ...
  html:
    code-fold: true
    html-math-method: katex
  pdf:
    geometry: [ "top=30mm", "left=20mm" ]

It only has three types: dictionaries, lists, and strings. There's even a more simplified version.

dbready commented 1 year ago

Trying to think more holistically, an eventual goal of this markup is that non-programmers could adopt it in various places: blogs, academic papers, forums, etc. In which case, using an existing JSON/YAML/TOML format is a disadvantage: for a layman, it becomes a bespoke “header metadata” format different from the rest of the djot markup.

From the angle of minimizing language size, I am in favor of matklad’s suggestion to use the existing djot attribute syntax. Less for a user to learn and easier to implement a parser.

bpj commented 1 year ago

If existing djot syntax is to be used, which I think is a good idea, it is best to use definition/(un)ordered list syntax so that hierarchical structures are possible, for example multiple authors as a bullet list and the name/affiliation/email of each as a definition list.

ffel commented 1 year ago

I'm very much in favour of metadata in djot documents. In pandoc I use title, author, date, and lang nearly everywhere. Often I add references local to one document (visited web pages).

My two cents (and sort of mentioned elsewhere): I suspect native definition lists will do, possibly wrapped inside a meta (or perhaps even djot?) div:

::: meta
title
:  Title of document
author
:  Author A
:  Author B
:::

When using a designated div type (like meta above) it will be possible to not only add a metadata block at the top of the document but also add meta data in later parts of the documents (perhaps, again, the citation information of a visited web site).

jgm commented 1 year ago

This probably doesn't affect what you want to say, but that isn't djot definition list syntax!

tbdalgaard commented 1 year ago

Yes that is one of my biggest issues with Pandoc. I like the idea of templates, but from a non programmer's perspective I never got into templates, so including metadata directly into documents is much appreciated here.

bpj commented 1 year ago

@tbdalgaard templates in Pandoc have to do with metadata only in as much as you can access metadata values from templates, but you can notably also access metadata from filters and use that to either insert metadata into documents or to configure filters. The original way to define metadata in Pandoc was through YAML blocks in the document body. Later we got the --metadata-file=YAMLFILE option and later still the metadata: section in defaults files. I'm not sure how getting metadata into Pandoc with metadata files/default files works with the new djot.js/JSON workflow. Hopefully it works. @jgm?

jgm commented 1 year ago

Yes, --metadata-file should work with pandoc -f json; however, the contents will be read as pandoc markdown, not djot.

matklad commented 1 year ago

I realized that there's a quite syntactically nice way to embed meta in existing djot:

# My Document
: author = Alex Kladov
: highlight-code
: highlight-theme = GitHub
: abstract

  Lorem Ipsum Dolores

Bla Bla Bla

You could say that : key = value isn't actually djot syntax, but it needn't be! If there's a filter which turns first definition list after title into meta, it can also split dt's on = into a key and a value

jgm commented 1 year ago

Nice.

ffel commented 1 year ago

I like this idea, however with one minor exception: with this propisal, h1 level section header becomes the document title.

I'd like to move this definition area to the top of the document with an explicit title field for the document title.

With this move it is more natural to use h1 level sections to split your document in major parts.

On Sun, Jan 15, 2023, 00:46 Alex Kladov @.***> wrote:

I realized that there's a quite syntactically nice way to embed meta in existing djot:

My Document

: author = Alex Kladov : highlight-code : highlight-theme = GitHub : abstract

Lorem Ipsum Dolores

Bla Bla Bla

You could say that : key = value isn't actually djot syntax, but it needn't be! If there's a filter which turns first definition list after title into meta, it can also split dt's on = into a key and a value

— Reply to this email directly, view it on GitHub https://github.com/jgm/djot/issues/35#issuecomment-1382965831, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2WNNYOE63CQ6CIBWOXXKLWSM3EPANCNFSM55FI7W7Q . You are receiving this because you commented.Message ID: @.***>

matklad commented 1 year ago

There’s https://github.com/jgm/djot/discussions/130 which proposes dedicated syntax for document titles. Everyone except me seems to be in agreement that title should just be a metadata filed, but I still just don’t see that personally :) _Obviously_ title is the element you start your doc with, both in the the source code, and in stand-alone HTML (title goes to both <title> and h1) :)

jgm commented 1 year ago

reStructuredText has a convention that the first heading sets the title, and a "field list" after it is treated as metadata (IIRC): e.g.,

===================
Pandoc User’s Guide
===================

:Author: John MacFarlane
:Date: August 22, 2022

We don't exactly have a "field list" in djot, but perhaps we could/should steal the concept: https://docutils.sourceforge.io/0.4/docs/ref/rst/restructuredtext.html#field-lists

bpj commented 1 year ago

I fail to see the advantage of this :key = value over a regular definition list whose children (“definitions”) may be anything, not just one-line strings. You will run into the same extensibility problem as the INI file format and, eventually, similar clunky solutions (I foresee people doing things like foo.bar.bqz = quux) and letting renderers sort it out, which is no good. It’s better to build in the possibility of an hierarchical structure (and thus namespaces) from the start, and regular lists which can be nested is clearly the way to do it. It hopefully also will avoid the possible need to quote values which is a source of irritation in INI because in most INI variants you can’t do that, and multi-line values don’t become a problem. At one point I wrote a parser for INI with section syntax in pure Lua and it wasn’t fun because of the idiotic way hierarchical structures are expressed in that format, because it’s easy to confuse branches and leaves in that syntax; let’s not fall into that trap!

:::meta
: author

  - : name

      Libero Sint

    : email

      maxime@example.org

  - : name

      Officia Ut

    : email

      id@blanditiis.example.com

  - : name

      Neque Ea

    : email

      eum@reiciendis.example.com
:::

(I hope I got my lorem generator to produce correct djot definition list syntax. You get my idea!)

It may be more whitespace than some people like, but it uses existing djot syntax in an extensible way, which is key.

Obviously lists-as-meta could (probably should) have some additional restrictions such as definitions/values either containing just a nested list or a string which is treated as a plain string rather than rather than being parsed into a list of blocks/inlines, but it would be good if the structure as such uses the same basic syntax as regular lists.

(I moved my thought on definition list syntax from here to separate discussion in #193. I also wrote something on metadata vs. other data which doesn’t concern (meta) data structure as such in #192.)

jgm commented 1 year ago

Here's how metadata might look with reST style "field lists" (https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#field-lists):

# My Document

:author: Alex Kladov
:highlight-code: true
:highlight-theme: GitHub
:abstract:
  Lorem Ipsum Dolores

  Bla Bla Bla

## First section

This allows formatted and even block-level content for the fields. It does not yet support structured fields (e.g. metadata in the form of lists or structured objects). In pandoc you can have:

---
author:
- name: Sam Smith
  institution: Cal Tech
- name: Julie Wang
  institution: UCLA
...

which is quite useful. Of course we could model that as

:author:
  - :name: Sam Smith
    :institution: Cal Tech
  - :name: Julie Wang
    :institution: UCLA

or perhaps

:author:
  :name: Sam Smith
  :institution: Cal Tech
:author:
  :name: Julie Wang
  :institution: UCLA

Still, you lose some flexibility. E.g. in pandoc you can have

---
author:
- John Smith
- Julie Wang
...

which in pandoc metadata is clearly a ListMeta and not a BlockMeta with a list as its contents. For the latter, you'd write

---
author: |
  - John Smith
  - Julie Wang
...

In the field list syntax,

:author:
  - John Smith
  - Julie Wang

doesn't distinguish the two meanings distinguished above. But perhaps the "repeated key" approach does:

:author: John Smith
:author: Julie Wang

Note 1: with the "repeated key" approach to forming list metadata, you would not be able to override an earlier metadata item with a later one, as you can in pandoc.

Note 2: field lists would create an ambiguity with the symb syntax, making it impossible to express a paragraph beginning with a symb. That's probably bad. We could revisit symb syntax or find a different syntax for field lists.

jgm commented 1 year ago

All in all, I'm still liking a simplified YAML-ish syntax best.

Omikhleia commented 1 year ago

Let's do a small thought experiment. Please take it with a pinch of salt.

Can I have metadata with the existing Djot reader/parser as-is without changes, and just simple tweaks to my Djot renderer to some format?

What is a metadata item? Let's go for the simplest form: a (key, value) that can possibly be defined anywhere in my document, does not affect normal output, and can be collected so my renderer may do something useful with it. The value can contain formatted text, possibly spanning multiple paragraphs.

Wait a second... I do already have a construct for that! It's called a "footnote reference"... Let's go for it, but distinguishing it from my regular footnote space: I could just "reserve" some keys by a mere naming convention... Let's use colons in those metadata key names, for instance...

[^:author:]: John Smith{.smallcaps}
[^:title:]: The _Great_ Book

Now my renderer just has to look in the footnote references for those keys-with-colons, and use their value (e.g. put that title in a running header, or whatever).

Without needing YAML, etc. I can even have (Djot) lists and all whatnot's there! My renderer just has to use the bits of the Djot AST that it needs. It's quite straightforward...

So... Problem solved!

But wait again... I could still actually refer to those weird pseudo-footnotes in the flow of my text. Why not, no problem, and this might actually even be handy...

As the author[^:author:] said....

Ahem! Thinking further, Djot has this small loosely-defined things called "symbols" too... I don't really need emojis or whatever it was supposed to be... So let's assume my renderer could actually resolve these symbols using my metadata footnote references?

.... And suddenly, I went beyond just having metadata support... I also got templating with recursive variable substitution available... Saving myself the need for a pre-processing step in my workflow:

[^:author-firsname:]: John
[^:author-lastname:]: Smith{.smallcaps}
[^:author:]: :author-firstname: :author-lastname:

By the way, my name is :author:, pleased to meet you!

Nifty.

What could go wrong here? :rofl:

One could argue that using footnote syntax for this stuff is bad semantics. Quite right, possibly... but this is a lightweight markup language, so heh, after all... And if one wanted really distinct markup for different things, it's no longer lightweight, and it does already exist... it's called XML :grin:

vassudanagunta commented 1 year ago

@Omikhleia,

Wait a second... I do already have a construct for that! It's called a "footnote reference"... Let's go for it, but distinguishing it from my regular footnote space: I could just "reserve" some keys by a mere naming convention...

...

So let's assume my renderer could actually resolve these symbols using my metadata footnote references?

.... And suddenly, I went beyond just having metadata support... I also got templating with recursive variable substitution available... Saving myself the need for a pre-processing step in my workflow:

...

One could argue that using footnote syntax for this stuff is bad semantics.

Not at all. The form:

[key]: value

is already overloaded, used by both Reference link definitions and Footnotes, with the later effectively carving out a key namespace with all its keys prefixed with ^.

In a meta markup language I'm working on (Plain Text Style Sheets), I've a generalized notion of reference definitions (is there a better name?) which includes key-value definitions just as you described, supporting reference links, footnotes, metadata, and automatic substitutions/macros. References can also be defined for content elements, e.g. named anchors to headings or any block/inline span, table and figure references, important term introductions/definitions, citations, index entries, glossary definitions, hashtags. Recursive resolution is also supported. Author/reader-friendly namespaces, if necessary, are easily defined by a simple character prefix, e.g. ^ for footnotes, # for hashtags, though I don't recommend too many namespaces as multiple definitions for the same base key will be confusing. Different ambiguity resolution rules are supported, e.g. first def wins (like CommonMark), scope-based (defined by section and page hierarchies) or strict/fail-fast on any name collision. I'd also like to make numbered list items automatically referenceable, e.g. for a link to "step 2" that also reflects any list item renumbering.

pkulchenko commented 1 year ago

I'd like to add my 2c after reading this thread, as I'm very interested in this functionality and plan to integrate it into one of my projects. As far as I understand, there are two (largely independent) aspects being discussed: the location of the metadata and its format.

Location:

Format: various options are listed in https://github.com/jgm/djot/issues/35#issuecomment-1310567197 and https://github.com/jgm/djot/issues/35#issuecomment-1384435836. Most of the options are format-independent, so can be integrated with any of the proposed formats, but using footnote references would largely define the format as well.

I listed some of the pros/cons for each of the options (although I'm sure the list can be extended). All locations require their content to be hidden (maybe with the exception of footnote references), so may not work well with processors that don't recognize the syntax.

I find the option of using footnote references really interesting, but it's likely to suffer from difficulties expressing elements that require arrays or sub-elements (for example, multiple authors with names, email and affiliations). If there is a good way to address this, then I'd favor this option. The attribute syntax has similar advantages (and is likely a bit less verbose), but doesn't allow multi-line values.

Using the meta element is probably the most flexible one, but would require a separate processing, depending on which format is selected. I'd prefer Lua tables (and there are easy ways to suppress function execution there if needed), but I can understand why other formats may need to be supported (instead or as well).

(updated 7/30 to add attribute syntax)

Omikhleia commented 1 year ago

Nice summary @pkulchenko

additional file (just for the sake of completeness)

The content of a "metadata" block remains to be specified, with use cases largely depending on the context -- suffice to look how scattered is the use of such blocks in existing Markdown solutions (static web site or blog generators all have their things, etc.; without clear namespacing... in some documents I saw sansfont, margin-xxx etc. which is a huge conflation between styling paradigms and rendering options for specific tools.)

In other terms, to @jgm 's initial question ("Should there be a built-in format for metadata, or should that be considered distinct from the markup syntax?"), I am tempted to answer negatively to the first point (and thus positively to the second).

TheDecryptor commented 1 year ago

(Yet) another option would be the raw blocks, they're already set aside for special treatment by the processor.

``` =yaml
author: My Name Here
date: 2023-08-01
tags: [a, b, c]

...



It does leave the exact choice of metadata format up to the application consuming the document, which is a bit of a shame, but since it explicitly states the format you could always rewrite it easily if needed.
toastal commented 1 year ago

YAML has such overcomplicated parsing rules. I’d be happier with something simpler but based on YAML rather than full YAML compliance if going that route—since full compliance would likely involve reliance on an entire YAML parser library as a dependency.

Personally I don’t like the ad hoc nature of reusing a code block versus something more first-class as it becomes trickier to understand that it’s special, such as for editors to suggest the block is foldable/concealable, etc., or for consistent metadata fetching. It would be ideal in many build systems to be able to call something like djot metadata --format json so outputs can be piped to other tools. If it could be in several inconsistent formats, this task becomes difficult.

bpj commented 1 year ago

It won’t do to just assume that a raw block marked with =yaml or =json or whatever is a metadata block. What if you are writing documentation for software which takes its configuration from files written in the format you have chosen as metadata format? At the very minimum you will need to use for example =meta-yaml, but overall it is much better to have a built-in metadata format in dedicated metadata blocks uniquely marked as such which the djot parser parses out of the box, and which is expressive enough from the get-go so that people aren’t tempted to come up with bespoke extensions or alternatives.

I agree that a dependency on a full YAML library probably should be avoided, but it would be good to consider what makes YAML attractive:

Unfortunately this human-reader friendliness comes at the price of requiring syntax rules which often are not at all intuitive to human writers in order to accommodate the “computer reader”. So what is needed is a reasonable (assuming YAML is unreasonable) compromise between those human-friendly features and features which are “computer friendly”.

However I believe that this dichotomy is a bit of a red herring: any format which is meant to be read and written by both humans and computers has to strike such a balance, including djot, which already leans heavily in the direction of human-friendliness. I have said it before: djot already can parse both key-value lists, namely definition lists, and bullet lists, so it makes sense to reuse djot list syntax for which the parsing facilities are already in place! The problem is that you probably won’t usually want metadata values (or keys) to be parsed into textual elements — emphasis, spans etc. One solution might be to mark “raw text” as raw blocks/spans with a format =text since it is probably unlikely that someone will come up with a code or markup format called “text”. The problem with this is that it means that what probably is the most common case will be specially marked. Perhaps the best solution to this is to simply not at all support markup inside metadata keys and values beyond the basic key-value/bullet item structure, which maybe can be handled by a parameter to the list parsing function(s)? If the metadata values are plain strings with any markup literally preserved the application using djot can pass individual metadata values to the djot parser as and when needed.

pkulchenko commented 1 year ago

Interesting comments. I've spent some time trying different options and then looking at the generated html, json and AST. To me the attribute approach looks like a winner given how concise it is comparing to some other options. I also like to think about it as a way to associate attributes with the document itself instead of specific elements.

I'm interested in being able to support the following:

The approach with attributes checks all these boxes for me as shown in the following example:

{attr1="bar and\
 baz"
 .clssy
 attr2=more}
{updated=20230801 attr2=less}
---

# title

{source="personal-experience"}
> More than three people on one
> bicycle is *not* recommended.

I'd recommend using --- as the first element to associate attributes with (as it looks like the existing front-matter syntax from jekyll), but it's actually optional for my proposal. I'd use the attributes from the very first element in the document with the exception of section (as associating the attributes with a header creates section/heading structure with attributes associated with the heading element). One advantage of using the thematic_break (---) is that it will get only meta attributes, whereas other elements may have their own attributes, but it's a minor consideration.

This approach allows adding and overwriting attributes (as shown above with attr2 getting less assigned instead of more) and possibly providing multi-string values (although it may require using \EOL escaping). All this information is already available in html, JSON and AST, so wouldn't require any additional processing and can accept any custom attributes.

This syntax is quite forgiving in terms of quotes being optional, but it does require brackets to be on the same line as some of the text. I'll try it with few more scenarios and report back if I run into any difficulties with it.

Omikhleia commented 1 year ago

One advantage of using the thematic_break (---) is that it will get only meta attributes, whereas other elements may have their own attributes, but it's a minor consideration.

Just a quick remark: this is not true, thematic breaks can (and should) have real attributes, with nothing "meta" about them. In real books, no one uses a mere (full or not) rule in all circumstances. I am currently using, for instance (non exhaustively):

{ .dinkus }
---

{ .asterism }
---

{ .pagebreak .pendant type=floral }
---

I.e. styling thematic breaks (here, to possibly obtain; respespectively, a centered * * *, or a floral pendant introducing a page break in print) while still preserving semantics (a thematic break indeed, so a hr-like rule or whatever is still an option for non-compliant renderers, or non-existing or non-supported styles). And though rare, there are cases when it had to occur at the start of (sub)document. That is to say: overloading the existing thematic break with other considerations is likely a wrong approach. It has it's own rights to classes and attributes!

pkulchenko commented 1 year ago

Just a quick remark: this is not true, thematic breaks can (and should) have real attributes, with nothing "meta" about them. In real books, no one uses a mere (full or not) rule in all circumstances.

I should have been more explicit; what I meant was that in this case, the thematic break is only added to separate (document) attributes from the rest of the document, so it won't have any other attributes (as it wouldn't exist in the document otherwise). Associating document attributes with any other element would lump them together with all other attributes that may already exist for that element.

toastal commented 1 year ago

front matter

The whole concept of ’front matter’ exists because Markdown, unlike most other document/media file formats, did not provide a native way add metadata. It’s a hack & should be avoided, not replicated.