jupyter-book / myst-spec

MyST is designed to create publication-quality, computational documents written entirely in Markdown.
https://mystmd.org/spec
MIT License
14 stars 6 forks source link

Standardisation of common attributes (classes, names) #32

Open chrisjsewell opened 2 years ago

chrisjsewell commented 2 years ago

As specified here: https://docutils.sourceforge.io/docs/ref/doctree.html#common-attributes, there are some common attributes associated with all docutils nodes, and this should essentially be the same here.

As an example, here: https://github.com/executablebooks/myst-spec/blob/35f80974a69f68490b007c3e6d919ed246f64594/docs/examples/directives.admonitions.yml#L122

This should be classes: ['tip']

rowanc1 commented 2 years ago

Is there something in the mdast ecosystem to point to?

React is classNames and html is class. Both are strings.

Can we have an array of classes with spaces in them?

This shouldn't be allowed:

classes:
    - 'myClass mySecondclass'

Instead:

classes:
    - myClass
    - mySecondclass
chrisjsewell commented 2 years ago

FYI in unified-myst, this is what I am currently doing: https://github.com/executablebooks/unified-myst/blob/096dd8da49ce609ea9a1edec4a492e3798f63df1/packages/core-parse/src/directiveProcessor.js#L83-L111

chrisjsewell commented 2 years ago

@rowanc1 and @fwkoch further to our discussion regarding identifier: on further thought, I feel it's just irreconcilable with jupyter-book/myst-parser, to only allow a single identifier per element.

Take this simple example:

# main

## subtitle

(target1)=
(target2)=
### Sub-subtitle

[ref1](target1)
[ref2](sub-subtitle)

This is how it is resolved by docutils:

$ myst-docutils-pseudoxml test.md            
<document ids="main" names="main" source="test.md" title="main">
    <title>
        main
    <subtitle ids="subtitle" names="subtitle">
        subtitle
    <target refid="target1">
    <target refid="target2">
    <section ids="sub-subtitle target2 target1" names="sub-subtitle target2 target1">
        <title>
            Sub-subtitle
        <paragraph>
            <reference refid="target1">
                ref1

            <reference refid="sub-subtitle">
                ref2

As you can see, not only is the header assigned the identifiers coming from the targets, it is also assigned a "slug" identifier based on its content (which is not an unusual practice when rendering Markdown).

Not allowing multiple identifiers would render this example, and by extension jupyter-book itself, non myst-spec compliant, which is obviously extremely problematic 😬.

To clarify some extra terminology from docutils:

Here also is the rendering of this example as html/latex:

$ myst-docutils-html5 test.md
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
...
<body>
<main id="main">
<h1 class="title">main</h1>
<p class="subtitle" id="subtitle">subtitle</p>

<section id="sub-subtitle">
<span id="target2"></span><span id="target1"></span>
<h2>Sub-subtitle</h2>
<p>
<a class="reference internal" href="#target1">ref1</a>
<a class="reference internal" href="#sub-subtitle">ref2</a>
</p>
</section>
</main>
</body>
</html>
$ myst-docutils-latex test.md
...
\begin{document}
\title{main%
  \label{main}%
  \\%
  \DUdocumentsubtitle{subtitle}%
  \label{subtitle}}
\author{}
\date{}
\maketitle

\section{Sub-subtitle%
  \label{sub-subtitle}%
  \label{target2}%
  \label{target1}%
}

\hyperref[target1]{ref1}
\hyperref[sub-subtitle]{ref2}

\end{document}
chrisjsewell commented 2 years ago

FYI, if you want to see how anything is resolved by myst-docutils, simply install https://github.com/pypa/pipx, and pipx install myst-docutils, which will give you access to the above CLIs

rowanc1 commented 2 years ago

Can you point to a user using multiple stacked targets in a Jupyter Book that exists today? Or an example of this being used in a sphinx project?

A few notes:

  1. This is a divergence from what is set out in mdast (which already has precedence for identifier/label)
  2. It is a large complication to support in down steam tools.
  3. Helping users get to a canonical ID should be a goal of our work. This has benefits in science communication (e.g. see work on PIDs).
  4. The HTML example output is not semantic, (e.g. $('id').innerText). I think we can do better than docutils here.
  5. The LaTeX compiles, but is also very non-standard, and there are zero tutorials I have found that suggest this is even possible. (i.e. it is a relatively unknown quirk of latex)
  6. The python parser could easily: throw a warning on multiple stacked labels before passing on to sphinx, and help the user improve their references to get to a canonical, explicit label.
  7. There is no reason that myst has to support all quirks of docutils

My take: multiple-ids are unused (I have never seen this in any non-contrived, user example[^1]), bring no additional features to the end user, and can be easily cleaned up by throwing warnings in a post-parsing transform in any implementation. Introducing a list of IDs to refer to a single element is a significant additional complexity that means all state management becomes harder, especially for cross-project linking (e.g. some equivalent of inter-sphinx, or any work around PIDs ongoing in research/library communities).

Looking forward to talking this through on Monday. There are lots of options on how to support this in Python/JB before passing on to sphinx. I am suggesting we support a subset of sphinx's complexity, and provide tools to help users refactor their documents with explicit references/labels.

[^1]: With the possible exception of implicit references that have subsequently been made explicit. I think this can be taken care of in a state-management task rather than in the MDAST spec though.

chrisjsewell commented 2 years ago

Looking forward to talking this through on Monday.

Yeh absolutely, happy to discuss. What I want to emphasize, is this is not a trivial choice. As we have discussed previously, myst-spec should initially represent what myst actually is now, not what we want it to be in the future

Can you point to a user using multiple stacked targets in a Jupyter Book that exists today?

Any project that refers to headings by both targets and heading slugs.

There are lots of options on how to support this in Python/JB before passing on to sphinx. The python parser could easily: throw a warning on multiple stacked labels before passing on to sphinx

I feel this is somewhat a misunderstanding of how Jupyter Book (via myst-parser) works: None of this processing is done by myst-parser, it's all handled by docutils/sphinx. Getting mst-parser to act in this manner, if it could be done, would at least require a substantial re-write, to override core parts of docutils functionality

There is no reason that myst has to support all quirks of docutils

I would not say that this is merely a quirk of docutils though, it is a core design aspect: https://docutils.sourceforge.io/docs/ref/doctree.html#common-attributes

significant additional complexity, especially for cross-project linking (e.g. some equivalent of inter-sphinx)

But inter-sphinx already does work with multiple IDs

This is a divergence from what is set out in mdast (which already has precedence for identifier/label)

I feel this is a misunderstanding of what identifier is actually used for in MDAST. It is not a canonical ID for a node and, whether we use singular or multiple IDs for a node, they should not be stored under identifier, specifically to delineate from MDAST's identifier Take as an example:

[a]

[a]: https://example1.com
[a]: https://example2.com

goes to MDAST resembling

<paragraph>
  <linkReference identifier="a">
<definition identifier="a">
<definition identifier="a">
  1. linkReference has an identifier which is not actually its identify, it is what is referencing (https://github.com/syntax-tree/mdast#association)
  2. there are multiple definitions with the same identifier (because they are eventually resolved "implicitly")
  3. The definition.identifier can only be referenced by linkReference, they are completely independent of myst identifiers, e.g. you cannot do {ref}`a`

this is also the same for footnoteReference/footnoteDefinition

Whether we use something like mystId (singular) or mystIds (plural), a core requirement should be: in a "well-formed" document, I am able to walk through the AST, and generate an unambiguous mapping of REFID -> Node, in order to resolve what a {ref) is pointing towards. For this requirement, note it does not actually matter whether the relationship is one-to-one, or many-to-one (just as long as it is not one-to-many, or many-to-many)

Helping users get to a canonical ID should be a goal of our work.

Taking the above discussion, I would ask what do you mean by a canonical ID? Since you can essentially have multiple ID "sets" within a single document: IDs relating definitions, footnotes, {ref}, Jupyter code cells, intersphinx (there is now a separate external role (https://github.com/sphinx-doc/sphinx/pull/9822).

sphinx essentially handles this via the any role, and the resolution logic underpinning it (https://www.sphinx-doc.org/en/master/usage/restructuredtext/roles.html?highlight=roles#role-any). Domains can maintain their own identifier maps, for particular reference sets.

chrisjsewell commented 2 years ago

A thing that one might consider, is also setting a (probably SHA256) UUID for every node in the AST. This would provide an "unequivocal" identifier for all nodes, irrespective of what was referencing it. Then specific reference names, are just aliases to those