jgm / djot

A light markup language
https://djot.net
MIT License
1.73k stars 43 forks source link

Citations #32

Open jgm opened 2 years ago

jgm commented 2 years ago

We need a syntax for citations that can be plugged into citeproc-lua or sent to pandoc for processing.

Pandoc's citation syntax seems a good basis. One thing we might change would be the syntax for author-in-text citations, which is currently a bit tricky to parse, because it requires lookahead.

Perhaps instead of

@foo [p. 15]

we should have something like

[+@foo, p. 15]
uvtc commented 2 years ago

I like the idea of djot having a simple unambiguous syntax for this that is less tricky to parse. It not only makes djot simpler and faster, but it also makes it easier for any future alternative implementations of djot to parse as well.

uvtc commented 2 years ago

@jgm , why do you suggest adding that + sign in there? Why not [@foo, p. 15] instead?

The [+@foo, p. 15] syntax suggests to me that it's one example of a more general syntax, as in

[+@foo ... ]  for citations
[+&foo ... ]  for ... maybe something else
[+*foo ... ]
[+_foo ... ]
jgm commented 2 years ago

[@foo, p. 15] is fine for a regular citation which might render as (Foo 2000, p. 15). I'm talking about syntax for an author-in-text citation, which would render as Foo (2000, p. 15).

uvtc commented 2 years ago

Ah. I'm not very familiar with citations. Thanks.

kmaasrud commented 2 years ago

This syntax seems very natural to me. Same goes for using [+@foo, p. 15] for author-in-text citations. No need to reinvent the wheel, and the citeproc syntax is familiar for many.

Djot will be a perfect fit for academic writing---a natural continuation of Pandoc Markdown, which many (including me) are using in academia today. Thus, having a well-defined citation syntax seems very important to me. What will it take to implement this? I would be happy to help if I can!

NotAFedoraUser commented 1 year ago

Org-Mode, another markup language added citation support in 9.5.

In that release they added the following syntax to markup a citation:

According to [cite: common prefix;@Key123 page 13; @Key982 chap 1; common suffix] ...

Which would render as (Key123 2000, pp. 13; Key982 2009 chap. 1), for example. They also allow you to specify a style of citation:

[cite/t/c: ...]
      ^ ^
      | |
      | Variant
      Style (Here, "t" means in text) ala: Foo (...)

The blog post from a contributor to Org-Mode lays it all out better than I could ever do in a GH issue: https://blog.tecosaur.com/tmio/2021-07-31-citations.html

Crucially, this kind of syntax would allow people to set different styles on each citation, which it seems is not (easily) accomplished in the discussed syntax proposal.

kmaasrud commented 1 year ago
According to [cite: common prefix;@Key123 page 13; @Key982 chap 1; common suffix] ...

That looks similar to what @jgm is proposing and the current syntax used by pandoc-citeproc, just with an english-defined syntax (using the word cite), which we would like to avoid.

I'm still in favour of encapsulating a citation fully in square brackets for easy parsing, and I think the choice of @ for simple cites and +@ for author-in-text should be enough customization.

jgm commented 1 year ago

The org-mode syntax (which draws on and extends the pandoc syntax) gets more flexibility (different styles) at the price of verbosity and English-language keywords. So each has its drawbacks and its advantages.

NotAFedoraUser commented 1 year ago

From the currect proposal this:

In [+@Smith2014 page 21-23] he talks about...

Turns into this:

In Smith (2014, pp. 21--23) he talks about...

Whereas to do the syntax ala Org-Mode:

In [cite/t:@Smith2014 page 21-23] he talks about...

While Org-Mode's syntax is longer winded, it is more flexible, allowing for more styles of citations, [cite/a:] or [cite/n:] or [cite/t:] I suppose one could accomplish the same task with a modification of the current proposal to include something like the following:

[-@Key] /* nocite (For inclusion the printed bibliography) */
[+@Key] /* in text cite (Smith (pp. 21-23)) */
[/@Key] /* author name citation (Smith) */

Perhaps this makes more sense for djot?

kmaasrud commented 1 year ago

[-@Key] /* nocite (For inclusion the printed bibliography) */

@NotAFedoraUser that is very clever! Along with +@, I think that should be sufficient for most use-cases. However, the [<some-punctuation>@<key>] scheme leaves room for a lot of flexibility down the road---if more citation variants are requested.

bpj commented 1 year ago

@jgm :

at the price of verbosity and English-language keywords

I hope both will be avoided!

bdarcus commented 1 year ago

Just came across djot; cool!

@jgm - I just thought I'd remind you about one wrinkle we stumbled on in org-cite development, which is the question of whether a local variant is a property of the citation as a whole (where we came down with org-cite), or the individual citation-reference (as it is in pandoc).

E.g. what happens if you have more than one reference in a citation with your proposed examples (the first example being where the author lists differs, and second where they don't)?

[@foo, p. 15;+@bar]
[@foo1, p. 15;+@foo2]
jgm commented 1 year ago

@bdarcus the proposal floated above was to use + for author-in-text citations. The thought was that it would go at the beginning of the citation list, thus

[+@foo, p. 15; @bar]

which would be equivalent to pandoc's

@foo [p. 15; @bar]

I hadn't envisioned allowing it to be put on subsequent items, and I'm not sure what sense that would make. Maybe I haven't grasped your thought here.

bdarcus commented 1 year ago

@jgm - in that case, I think I misunderstood, and it's a property of the citation as a whole, which is I think right.

bdarcus commented 1 year ago

One other difference between org-cite (and biblatex) and pandoc: it has two levels of affixes; one for the citation, and another for the citation-references.

It's useful when you have a multi-cite, and a style may sort the references within the citation.

[cite:see ;@doe22;@doe20, ch. 2]

So presumably in djot, it could just be:

[see ;@doe22;@doe20, ch. 2]
jgm commented 1 year ago

Yes, I think that would be a good approach. However, citeproc doesn't currently support two levels of affixes, so I don't know what we'd do with this.

bdarcus commented 1 year ago

Maybe a simple heuristic to flatten them (like merge with the affix of the nearest reference affix?), and later add support to citeproc as time and interest allow?

You may already have to do something similar when dealing with org-cite?

bdarcus commented 1 year ago

Is this issue pretty much resolved; just needs to be implemented?

And maybe also relies on #35?

I've been working on a project I have been planning from the beginning to integrate with this once it's available.

https://github.com/bdarcus/csl-next

ATM, I have my own AST, which is basically the new style input template model enhanced with rendered data (current example bibliography reference below), but I'm hoping it should be pretty easy to integrate with djot; both for document processing as a whole, and also to allow djot markup within field strings.

  [
    [ { contributors: "author", procValue: "Doe, Jane" } ],
    {
      date: "issued",
      format: "year",
      wrap: "parentheses",
      procValue: "2023b"
    },
    [ { title: "title", procValue: "The Title" } ],
    undefined,
    undefined
  ]
jgm commented 1 year ago

I wouldn't call it resolved! There are still a lot of choice points.

bdarcus commented 1 year ago

About the citation model/syntax itself, or other related issues?

jgm commented 1 year ago

the former

bdarcus commented 1 year ago

the former

So what are those outstanding questions?

I suppose one, that you may or may not have been thinking about, is locators: string + string parsing (as with the pandoc syntax and most current other examples), vs more structured.

For the project I'm working on, I just merged this, which actually isn't too bad in YAML:

suffix: [see, page: 23, section: V]

But I guess the pandoc optional brackets basically is the same.

I guess another, that came up with org-cite, is where to allow markup within the citation?

jgm commented 1 year ago

There are lots of questions. Do we want to support a huge range of variants like org? If so, how do we do that without English language keywords? How are prefixes and suffixes handled? How are locators handled? Do we use localized locator labels as in pandoc? How are locators distinguished from other suffix content? I don't have a lot of time right now to work on this, but this should give some idea.

bdarcus commented 1 year ago

Note: I edited this a bit much later to add something I missed earlier on affixes.

Since I'm thinking about and working on this area ATM, my thoughts:

Do we want to support a huge range of variants like org?

This is indeed the big question, since it's hard to reverse later.

My impulse is to say no, and just have two styles/commands; what in the academic literature on this are called:

  1. integral: AKA citet, textcite, narrative citations.
  2. non-integral: AKA citep, parenthetical citations.

These notions are very general, more so than in the TeX world, and for that reason should go fairly far.

EDIT: the caveat is some of the variants in the LaTeX world are for handling capitalization, which the above would not.

EDIT: Implementing the citation model now; here's for now how I'm dealing with this.

pub enum CitationModeType {
    /// Places the author inline in the text; also known as "narrative" or "in text" citations.
    Integral,
    /// Places the author in the citation and/or bibliography or reference entry.
    #[default]
    NonIntegral,
}

But I could also see:

If so, how do we do that without English language keywords?

Do something like org-cite, but use single characters. But that has its own trade-offs.

How are prefixes and suffixes handled?

I think you're referring to this above?

https://github.com/jgm/djot/issues/32#issuecomment-1430181965

In any case, yes, this is another decision point: affixes only or individual citation references (as in pandoc), or also for the citation as a whole (as in org-cite and biblatex).

Per my comment there, I'd prefer the latter, because the cost is low, and the benefit in terms of flexibility for users high.

How are locators handled? Do we use localized locator labels as in pandoc? How are locators distinguished from other suffix content?

In my in-progress project (which I'm now focusing on a Rust implementation; just haven't done the citation part yet), here's the typescript definitions for locators.

export type Locator = Record<LocatorTerms, string> | string;

type LocatorTerms =
  | "book"
  | "chapter"
  | "column"
  | "figure"
  | "folio"
  | "number"
  | "line"
  | "note"
  | "opus"
  | "page"
  | "paragraph"
  | "part"
  | "section"
  | "sub-verbo"
  | "verse"
  | "volume";

In YAML:

suffix: [see, page: 23, section: V]

But that's a format more for machines; not humans. E.g. it's what the djot markup might be converted into.

This is another tricky area; my impulse is just to do what you've done in pandoc.

Do you see any glaring problems with that?

jgm commented 1 year ago

The pandoc way has worked pretty well. There are occasional requests for more expressive power, but it seems enough for most users.

kmaasrud commented 1 year ago

[...] but it seems enough for most users.

Based on my personal experience of academic writing, I concur. The less complexity, the better; that'll keep it simpler for implementors.

gfarrell commented 7 months ago

For my own purposes, I started adding the citation format specified in this issue into my own djoths fork.

  1. Parsing is fine, rendering to HTML is fine (ish, one question below), but the bit I'm stuck at is: do you think the references have to be contained in the source text? I'm thinking of how, for example, you can have a LaTeX document with separate BibTeX file. That engenders two pathways: djot implementations have to be able to specify an input map of references (or a references file) OR there has to be additional syntax for specifying references as part of the djot specification.

  2. On rendering to HTML (both for the bibliography and the inline citation itself), is it better to have a standardised output as described above (e.g. either "author-in-text" or "author-in-parentheses") or would it be better to allow the user to specify a CSL stylesheet (perhaps with a default stylesheet) which, sadly, would mean another external input to the djot implementations.

I know it's quite possible that something will block this from making it into the djot spec any time soon, but I thought I'd ask given that I am implementing anyway, and maybe that implementation will make it into djoths when the spec gets updated, so I'd rather do this semi-informed than 0% informed.

jgm commented 7 months ago

For parsing, we just need to specify the syntax of citations and a corresponding AST element.

For rendering: that's a matter of what we do with the citations. Here djot itself could be neutral, but I think the most powerful thing to do would be what pandoc does: use a citeproc processor to create citations and bibliography using a CSL stylesheet and external references. (Here in a Haskell implementation you could simply use my citeproc library.)

jgm commented 7 months ago

Re providing a way to put citations inside the document itself: pandoc does allow this, in a references field in metadata. So this interacts with the metadata issue.

bdarcus commented 7 months ago

Random quick thoughts:

Here djot itself could be neutral, but I think the most powerful thing to do would be what pandoc does: use a citeproc processor to create citations and bibliography using a CSL stylesheet and external references.

The advantage of that is that, like djot, CSL is agnostic about output format. So it's a good match.

I guess the question is how closely and formally they are tied.

Someone that primarily targets LaTeX might want to bypass CSL and use bibtex/biblatex.

Also, I do have ambitions of finishing my CSLN project and hooking it up to djot, so hopefully there's room for that sort of alternative.

jgm commented 7 months ago

I don't think specifying a syntax for citations (and perhaps reference lists) requires tying djot to any particular mode of rendering citations. Pandoc's citations, for example, can be rendered using CSL or natbib or bib latex or org-cite, depending on command line options.

bdarcus commented 6 months ago

@gfarrell:

I know likely premature ATM, but since you've been working on it ...

I started adding the citation format specified in this issue into my own djoths fork.

Was looking at the test cases, and just wondering about one design question we hadn't settled.

From what I can tell, your implementation follows the pandoc way; no global affixes?

The concrete question that issue raises is what happens if you have a citation like this, and the citation processor is using a style that requires reordering the references within the citation by date issued?

[see @doe24; @doe20]

Without global affixes, you either end up with something like this (which is simple wrong; the author here is intending to list multiple references to "see"):

(Doe, 2020; see Doe 2024)

... or you require the user to track that order, AND adjust it if the citation style changes.

So in org-mode (which is an iteration of the pandoc model and syntax), for example, you would do:

[cite: see; @doe24; @doe20]

My argument has been it's a niche feature important in some fields (notably in the humanities and social sciences), but that adding it to djot is low-cost for users and developers alike.

Regardless, you probably want to include some prefixes in the test cases?

jgm commented 6 months ago

@bdarcus citeproc doesn't have a notion of global affixes, does it? (by the way, the way citeproc-hs handles this is just by blocking re-ordering around an affix; at least that prevents misleading things from appearing.)

bdarcus commented 6 months ago

@jgm:

citeproc doesn't have a notion of global affixes, does it?

You mean citeproc-js?

It does not.

It's an iteration that made it into org-cite. So it's supported in the org citeproc-el integration.

(by the way, the way citeproc-hs handles this is just by blocking re-ordering around an affix; at least that prevents misleading things from appearing.)

I hadn't thought of that, but that might be a reasonable alternative.