Defining mdast for citations

rowanc1 commented 2 years ago

Currently doing some investigation on citations and thought I would post it here as it would be great to get on the same page for the data-structures for citations in mdast (I think there is more thought probably on the myst-syntax, do we adopt [@key] pandoc style citations, etc.). I would love to be aiming for the same place for the mdast data structures as the other syntax conversations evolve.

For a piece of technical content, the best practices for in-text citations are probably latex/natbib and pandoc citations which are defined here:

I am think the following mdast data-structures might capture everything:

type CiteGroup = {
  type: 'citeGroup'
  kind: 'narrative' | 'parenthetical'; // 'citet' vs 'citep'
  children: Cite[]
}

type Cite = {
  type: 'cite'
  identifier: string
  label: string
  expand: boolean // this is the * in natbib, expands authors, false by default
  partial: 'author' | 'year'
  prefix: string // e.g. "see" or "e.g."
  suffix: string // e.g. "99 years later" or something
  locator: string // e.g. "chap. 2", joined with a comma -- defined by CSL locale (pp. fig. etc.)
  // alias: string // use "Paper 1", maybe do this later?
}

I think this works pretty well and can fit with the {cite:t}`jon22` syntax we already have defined, but maybe in the future there is some way to give roles more data: For example: {cite:p}[prefix="see", locator="chap. 2"]`jon22` would yield: (see Jones et al., 2022, chap. 2) Or maybe there is a specialized way to do this with [see @jon22, chap. 2] (see pandoc)

For multiple citations, the citeGroup would never be a directive or be in the markup, (i.e. [@key1; @key2] or {cite:p}`key1; key2`), but I think that the AST data structure is better represented by multiple nodes, one holding the group (parenthetical) information, this also means UIs can open groups of citations in a list (e.g. see distill/elife as good examples of this UI).

Both cite and citeGroup would be flow content, so the equivalent of a "citet" in latex is just a cite node in a paragraph (@key1 in pandoc style).

Some questions:

what is the best name for citeGroup?
~~should we follow kind or have some different flags like parenthetical? I suggested kind because that seemed easier to expand in the future if we add num or alt etc.~~ (previously suggested a single cite node, splitting into group solves this).
narrative and parenthetical nomenclature comes from here

Existing implementations:

similar data structure here: https://github.com/timlrx/rehype-citation/blob/main/src/parse-citation.js#L139

Would be curious on your thoughts @chrisjsewell and @fwkoch (maybe @mmcky as well?)!

chrisjsewell commented 2 years ago

Would be curious on your thoughts @chrisjsewell

See https://github.com/executablebooks/MyST-Parser/issues/511 😉

fwkoch commented 2 years ago

We still need info about kind, num, etc (i.e. the things you crossed out) on the cite group, right?

I had something like:

type CitationGroup = {
  type: 'citationGroup';
  kind: 'narrative' | 'parenthetical'; // 'citet' vs 'citep'
  parentheses: boolean; // if false, 'citealt' and 'citealp' instead
  mode: 'year' | 'numerical';
  children: Citation[];
};

And even a single citation is a child of of a citation group in the AST?

(Also, I like citation and citationGroup since these are "citations" not "cites" - but... that's more verbose and doesn't match natbib)

chrisjsewell commented 2 years ago

For sure, I think citations should be a "first-class citzen" of MyST 👍

One think that I do think its worth thinking about, is do you actually need to restrict "citations" to just the conventional bibligraphy type references? Essentially, the abstraction is just a key(s) that references an external resource (bibtex, json, yaml, ...) which contains a dictionary of key -> fields , e.g.

key:
  field1: content
  field2: content

https://www.overleaf.com/learn/latex/Glossaries are also essentially the same abstraction as, to some extent, are https://myst-parser.readthedocs.io/en/latest/syntax/optional.html#substitutions-with-jinja2 (see also something I was playing around with https://github.com/chrisjsewell/sphinx-glossary/blob/main/docs/index.md)

Do you need different node types for all of these, or can it be "generalised"? Or at least share a parent interface

rowanc1 commented 2 years ago

Nice, I like those additions to the group @fwkoch -- the reason I also had cite is that is an HTML element (see mdn), so seemed like sticking close to html/latex here would be good. (not sure about the group name though, in Curvenote we also use this group to wrap crossReferences, for example, which can collapse (Figure 1 & 2) while still having unique links to the content)

@chrisjsewell, I think that the citations are special/important enough to be their own mdast type, but maybe the syntax for creating them can be the same/extensible (which would be nice from a writing perspective). We are currently backing out our generalizations for citations in Curvenote at the moment after a few years: citations are special/weird enough to have their own dedicated type/apis/endpoints/etc. 🤷

Again, that is the mdast cite type only (e.g. locator isn't applicable to glossaries, or mode=year to abbreviations), I think the myst-syntax can be extensible though. 👍

mmcky commented 2 years ago

Thanks for starting this discussion. I agree with everyone here that citations are first class citizen of any scientific document.

I think a lot of users will come from LaTeX and bibtex background so some basic LaTeX similarities such as:

a simple {cite} role (as we have)
a way to change the style of references that are printed in the reference list to suit (i.e. Harvard etc.)
a way to change the style of references in the text such as [1] and Jones (2009)

this combination covers pretty much most of my use of references from a LaTeX universe.

I also really like the flexibility being discussed here in adding more sophisticated references such as pages, see and chapter references. I agree that natbib is a good reference, and the concept of metadata for roles is an interesting idea. I wonder though if we aded an option extension syntax such as:

{cite}`jones1999 <<locator='chapter 2>>`

My other wish list item would be support for .bib (bibtex) files as a source of data for the citations, as I know a lot of authors that have invested in bib collections; in addition there are a lot of webpages that know provide copy and paste bibtex entries.

Also this must be a javascript thing but don't fully see why we need both an object name and a type defined

type CiteGroup = {
  type: 'citeGroup'

I guess you can't do the equivalent of isinstance() as done in python?

jupyter-book / myst-spec

Defining mdast for citations #21

Some questions:

Existing implementations: