jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.38k stars 3.31k forks source link

New Feature: internal links to tables and figures and headers #813

Open GeraldLoeffler opened 11 years ago

GeraldLoeffler commented 11 years ago

It's currently possible to include internal links to sections. I'd like to propose a similar feature for links to figures/images and tables.

It may make sense to provide this feature only if the figure/image or table that is being linked to has a caption. In that case Pandoc can today automatically generate a number for the figure or table and include it in the caption, e.g. "Figure 15".

At the most basic, the text of the link would be provided by the user, as is currently the case for links to sections.

Of course it would be very convenient if the automatically generated number for the figure or table would also be used for the text of the link, e.g. "as can be seen in Figure 15, blah", where "Figure 15" would be the internal link whose text is auto-generated from the figure it points to.

jgm commented 8 years ago

+++ blindmelon [Dec 04 15 08:36 ]:

[1]@lierdakil Just weighing in quickly on the matter of references. Reference attributes would make it nice and easy to reference to, e.g. reference the section a table is in rather than than the table itself, or the controversial pageref. Best to make the deep changes now so that it is just the matter of making changes in the readers/writers later on.

A Div container already has an id and arbitrary key/value attributes. So, why isn't that enough?

The id tells you unambiguously which element you mean to refer to. Providing a way to get a section number or page number corresponding to that element is another matter -- but I don't see how a dedicated Figure type would make a difference for that.

jgm commented 8 years ago

Sorry, @blindmelon, I see that you were talking about a separate Reference element rather than a Figure element. I think it would be great to have the power of LaTeX's labels and references. But it would require some deeper changes -- for example, section numbering would have to be integral to the document model rather than (as now) just added by writers.

Anyway, it still strikes me that the issue of a dedicated Figure element and the issue of a dedicated Reference element are separate issues (either could be done independently of the other).

lierdakil commented 8 years ago

@jgm, not sure I follow. Why? pandoc-crossref makes due without that well enough. Sure it could be better, but not a prerequisite I think.

jgm commented 8 years ago

@lierdakil - sure, I suppose the filter (or whatever) can always reconstruct the section numbering. But one issue is this. If it's not part of the AST that the sections are numbered, then we might end up with references to numbered sections even though section numbers aren't printed. That's awkward, at least!

jgm commented 8 years ago

+++ aaren [Dec 04 15 08:25 ]:

Having Figure Attr [Block] [Block] does feel a bit redundant when we already have Div Attr [Block]. Why not just treat the first Para as the caption? I suppose the Figure caption can have completely arbitrary content (Figures all the way down!), rather than just Para.

One way around this would be to treat the first Para or BlockQuote as the caption. If you wanted multiple blocks, you could put it in a BlockQuote (and the surrounding blockquote would not be part of the caption).

Assuming something like the colon syntax for divs,

::::::: {.figure}

This is my caption.

It has two paragraphs.

The image - this now becomes regular alt text Second image :::::::

aaren commented 8 years ago

@jgm isn't that (using numbering or not) the user's problem? If there is no numbering then either the user puts a link over [whatever text they want to use](#ref) or a filter does something clever and puts in 'the figure above' or similar.

lierdakil commented 8 years ago

@jgm, another option would be playing with html5 figure/caption convention:

:::::::{.figure}
:::::::::::::{.caption}
Arbitrary block elements
:::::::::::::
![Figures all the way](img.png)
![Etc](img2.jpg)
:::::::

This way, it doesn't matter where this caption is placed, before after or in the middle of things, it's still a caption.

ghost commented 8 years ago

@jgm I see what you mean. I am mostly working with odt and LaTeX output (markdown input, otherwise whats the point?), both of which can do the numbering side of things. My sections, for example, are numbered using styles in my reference odt. At the end of the day I think the writer needs to deal with inappropriate reference types in the input however it is most suitable for the output format.

jgm commented 8 years ago

If we think about references, there are many kinds of numbered things one might want to refer to:

see Example (14) see Equation 15 see footnote 6 see Table 3.1 see Figure 5 see Section 3.2.1 see Code Sample 14 on p. 13

Currently pandoc supports only one of these -- the first, through numbered example lists. Ideally, we could support all or most of them. The questions we need to ask are:

  1. What would be good Markdown syntax, both for the references and for the labels? Presumably it would make most sense for the labels to be identifiers, which can already be applied to most anything (either directly or through a Span or Div). But it's less clear what the references should look like. How do we mark whether a reference is to a figure, a table, an enclosing section, etc.? (We want to avoid using English words, and we want to avoid looking like Perl; it should look natural and readable in plain text.)
  2. How could the references be implemented in the underlying types?
  3. Won't we need "counters" or numbers in the underlying types? Currently figures get numbers in LaTeX/PDF output, because that's the default, but not in many other formats. And pandoc doesn't directly control the way they are numbered. So in assigning numbers we're shooting in the dark, unless we control the numbering ourselves. And that adds a lot of additional complexity (think of the various things that need numbers, and the various numbering schemes -- e.g. figure numbers that start with a section number).
lierdakil commented 8 years ago

@jgm

  1. I mostly write in Russian. Russian layout lacks keys for braces {} and brackets []. So from my point of view, pseudo-english identifiers make about as much sense as everything else. With pandoc-crossref, I went with prefixes to identify item classes, e.g. fig: for figures, eq: for equations etc. Mind you, that's not required -- in general, you know what element is referenced, so you generally know what kind of element that is.
  2. Well, again, in pandoc-crossref, I leveraged existing citation elements to represent references. Pros being I could do something like as shown in [@fig:one_figure; @fig:other_figure; @fig:related_figure]. Cons being it could clash with citation identifiers. I would argue that separate type with similar semantics would be preferable, but citations actually do work fine.
  3. That's a tough question. Obvious answer is "just delegate this to writer", and it's a compelling one. For one, some output formats (LaTeX, Word and OpenDocument from the top of my head) have their own counters, which could be used.

At least that's my take on it.

ghost commented 8 years ago

we could support all or most of them. The questions we need to ask are:

What would be good Markdown syntax, both for the references and for the labels? Presumably it would make most sense for the labels to be identifiers, which can already be applied to most anything (either directly or through a Span or Div). But it's less clear what the references should look like. How do we mark whether a reference is to a figure, a table, an enclosing section, etc.? (We want to avoid using English words, and we want to avoid looking like Perl; it should look natural and readable in plain text.)

I currently deal with tables and figures by post processing the output odt (using sed, python and perl all glues together by zsh). The syntax I use in markdown, which I pinched from this thread I think, is {#identifier} either after the table caption or before the image alt (it is what fit the syntax best - although I'm not a fan of the image alt being the caption, tbh). When I want to reference the table or figure I use [#identifier]. I tend to prefix my table identifiers with "T" and figures with "F", which has the advantage of making it stand out of my post-processing goes wonky. It isn't really necessary.

I am strongly opposed to pandoc doing itself what the output format can already do. If the output format has a way of dealing with numbering, it should be used so that the document can be worked on later by non-pandoc users without torturing them. For that reason, I think most of the work naturally goes to the writer. For output formats that don't have numbering (plain-text-ish formats, for example), the writer can just grab a number from an iterator (if they are called that in haskell) and possibly transform it according to convention/user requirements.

mb21 commented 8 years ago

One way around this would be to treat the first Para or BlockQuote as the caption. If you wanted multiple blocks, you could put it in a BlockQuote (and the surrounding blockquote would not be part of the caption).

I think that's an interesting idea. But it begs the question: isn't that similar to the hack we already employ for image figures?

tomduck commented 8 years ago

Reminder: there are filter-based solutions that can be used while this issue gets worked out. The following implement numbering and references using the syntax advocated by @scaramouche1:

They are python-based and easy to use. Alternatives are provided above by @aaren and @lierdakil.

Note: pandoc-fignos has been updated to work with the new figure attributes syntax that will appear in pandoc 1.16.

beinvest commented 8 years ago

Now that pandoc 1.16. is out, is this a bug or pointing at my misunderstanding of the new link_attributes extension?

Converting ![My caption](myfigure.png){#fig:myfigure} from Markdown to LaTeX, I would have expected

\begin{figure}[htbp]
\centering
\includegraphics{myfigure.png}
\caption{My caption}
\label{fig:myfigure}
\end{figure}

but instead the figure id/label is ignored.

mb21 commented 8 years ago

@beinvest good point, must have overlooked that back in the day when I did the image sizes ;) fixed in https://github.com/jgm/pandoc/pull/2637

beinvest commented 8 years ago

@mb21 Thanks for the help and your work!!

lierdakil commented 8 years ago

So, I mused on this for a bit, and here are some questions and ideas, in no particular order

  1. Do we want to add attributes (i.e. identifiers) to all elements that can be referenced? Images and sections have those already. But what about tables/equations/etc? Could we use divs/spans? Should we?
  2. Related to (1), if we go with divs/spans, would it be a good idea to add "implicit spans" extension, that would wrap any (or some) inline element followed by attribute specification in a span? It's an easy change to Markdown parser. A shorthand for divs that wrap a single Block element could also be a good idea.
  3. After some thought, I think a dedicated reference element/syntax is a must. [#id] seems nice, and semantics should be basically the same as Citation. It should probably be possible to reuse code that parses citations for references, with little effort.
  4. In general, it should be possible to determine referenced element type based on heuristic. There should be a way to override it though. I suppose a key-value attribute should do the trick here (e.g. {#someid ref-type=figure}). Classes are a tougher sell IMO, although syntax is a little cleaner.
  5. For numbering, I think that should mostly be delegated to writer. For formats that don't support that, a filter similar to pandoc-citeproc could be employed. That said, even HTML supports counters nowadays, so this would only be relevant for plain text formats or cases where native numbering is suboptimal for some reason (I shudder at the thought of pains I endured fighting MS Word counters, so from my point of view, this is definitely a use-case to consider). A filter could also be a good option for transitional period, while writer code is catching up. Not to sound as a shameless self-promotion, but pandoc-crossref could be easily repurposed for this.
edusantana commented 8 years ago

Just a contribution for a workaround while this issue is open: http://tex.stackexchange.com/questions/139106/referencing-tables-in-pandoc

sjackman commented 7 years ago

@lierdakil pandoc-crossref works great! Thanks for this work!

jgm commented 7 years ago

I'm adding the pandoc-2.0 milestone so we at least think about whether to add some of these features to standard pandoc. (I'm using pandoc-crossref now and it works very well indeed.)

ibutra commented 7 years ago

It would be nice though if either pandoc or pandoc-crossref support the auto-identifiers.

lierdakil commented 7 years ago

@ibutra, not sure what you mean by 'auto-identifiers' exactly.

ibutra commented 7 years ago

Manual: The second entry named auto_identifiers is what I mean, basically the identifier given by pandoc on default if none is given manually for referencing

mangecoeur commented 7 years ago

@lierdakil I think @ibutra is referring to how Pandoc can auto generate section reference tags from the heading text (crossref already supports it for headings, with caveats).

I can see the appeal, for example I end up following the pattern:

![Plot text](../fig/plot_filename){#fig:plot_filename}

It could be an idea to generate a tag fig:plot_filename if one isn't explicitly given. Might be a bit unnecessary though (I just added an editor snippet to generate the pattern) but on the other hand, why not?

jgm commented 7 years ago

In headers, the identifiers are generated from the header text. The analogue in a figure or table would be to generate them from the caption text -- but this is likely to be too long and cumbersome. Still, it might not hurt to generate them; one always has the ability to specify an identifier manually.

ibutra commented 7 years ago

I specifically meant the headers though the same feature for figures and tables would be nice too.

What I didn't know @mangecoeur is that pandoc-crossref already supports this?

lierdakil commented 7 years ago

@ibutra, from https://github.com/lierdakil/pandoc-crossref#section-labels

You can also use autoSectionLabels variable to automatically prepend all section labels (automatically generated with pandoc included) with "sec:". Bear in mind that references can't contain periods, commas etc, so some auto-generated labels will still be unusable.

Generating labels for figures/tables/other has another drawback. Right now, the default behavior in pandoc-crossref is to ignore unlabelled elements (since this is least intrusive), so

![Caption](image) 

will be an unnumbered (or rather, unprocessed) figure.

This kind of behavior is useful for informal writing, when you don't need to number the figures you're not referencing. Also for running pandoc-crossref on documents that don't need cross-referencing at all, f.ex. from an automated script.

@jgm, for figures, a better (more concise) source of auto identifiers is probably not a title, but a filename (or rather, basename). Tables and listings are another matter, and I don't think it's feasible for math.

LivInTheLookingGlass commented 7 years ago

For RST the syntax should be much easier. Just use the already-available name field:

.. figure:: image.png
    :name: example
    :alt: an image

    This is the caption
ghost commented 7 years ago

see Example (14) ... see Figure 5

FWIW, the Markdown should not include the caption type text (e.g., "Equation", "Table", "Figure") as that is presentation logic. That is, without changing the source, it should be possible to replace "Figure" with "Illustration" throughout the output document.

Here are a few others, which suggests that the solution should be caption type agnostic. The complete set of possible captions is fairly long and we probably shouldn't try to restrain the syntax to a particular subset as some could get missed, such as:

see Listing (14) see Algorithm 5

Thus with the text, As seen in Figure @fig:force, the word "Figure" is redundant (the @fig already signifies the caption is a figure). With that particular syntax, As seen in @fig:force allows the rendering component (e.g., LaTeX, ConTeXt, etc.) to determine what caption type text to inject, if any.

sjackman commented 7 years ago

The above is also helpful when referencing multiple items, for example As shown in @fig:a;@fig:b => As shown in Figures 1 and 2 and ranges As shown in @fig:a;@fig:b;@fig:c => As shown in Figures 1–3

mangecoeur commented 7 years ago

Hopefully, if this is built into core pandoc the docx could gain the ability to output 'real' reference fields (using the office xml reference tags). This would allow you to post-process fields in Word, for example to generate tables of figures and tables of tables (Word can generate these when caption fields are used).

Hipomenes commented 7 years ago

Here it goes again... How does one cross-reference figures in Pandoc?

Thanks!

iandol commented 7 years ago

For the moment you should use filters, either pandoc-crossref (installs via homebrew if you use a Mac: brew install pandoc-crossref) or pandoc-fignos (you need a working python install). Personally I do all my writing in Scrivener, which has its own crossref system that outputs to Pandoc so don't use these myself.

petterreinholdtsen commented 6 years ago

It would be great if pandoc by default would support adding image/figure IDs and cross references when converting markdown to docbook. This would ensure the software needed is available in Debian.

I am currently typesetting a set of books using a Markdown->Docbook pipeline, and need a way to reference figures in the text.

ikcalB commented 5 years ago

@jgm is there any progress an this inside the main tree, or would you suggest using the filter pandoc-crossref?

mb21 commented 5 years ago

for now, use pandoc-crossref

esnahn commented 5 years ago

I hope there was an option to interpret links as numberings instead of hyperlinks, for but not limited to non-electronic media. Something like ![Figure fig#. Caption](/path/file.png) and [fig.](Figure fig#. Caption), stripping all automated stuff except the numbering.

tomduck commented 4 years ago

I am pleased to announce the 2.0.0 release of the pandoc-xnos filter suite:

The filters emerged from recommendations made by the community in this thread, and in particular this post by @scaramouche1.

despresc commented 3 years ago

For the label/ref problem, labelling itself is pretty simple: they're just opaque identifiers, though some document systems (like some LaTeX packages and the existing reference-providing filters) have prefixed labels like thm:thing. (Incidentally, my preference is for future Markdown syntax not to require any internal structure on labels, beyond, say, what citations already require).

Numbering things and rendering references to them, on the other hand, strongly resembles the process of generating citations and bibliographies, and the ways that can be done vary almost as widely. Typing of numbered things, choosing how to insert numbers in titles, reference prefixes, configuring numbering with counters, modifying counters in the text, and automatically generating identifiers can all be supported and configured.

So It will be hard to choose exactly how Pandoc will number things and render references, and what configuration will be allowed. It could be as complex as LaTeX, but I'm not sure if that complexity is welcome in pandoc itself (maybe it is?). The Markdown syntax for refs will also have to be chosen, though I imagine it will operate somewhat like the citation syntax does currently, judging from the discussion in the thread above.

Ideally, the intermediate representation would be modified so that in principle a filter could perform numbering and reference rendering like pandoc-citeproc does for citations, potentially more complexly than pandoc itself would. This can be done without settling the other issues. In the simplest design, labels (as identifiers) and numbers (if at all) can be stored in the Attr that we have now, requiring no IR change there. References should get their own element, and based on the current Citation type, the following could work:

-- Support for labelling more things can be added by adding Attr to more types.
data Inline
 = ...
 | Ref [Reference] [Inline]
 ...

-- Might want to record whether or not it's a page reference for
-- paginated formats like TeX.
data Reference = Reference
  { referenceId :: Text
  , referencePrefix :: [Inline]
  , referenceSuffix :: [Inline]
  , referenceMode :: ReferenceMode
  , referenceHash :: Int
  }

-- The main modifier of a reference at the reference site itself
-- is how to render a prefix, if at all. 
data ReferenceMode
  = UpperCasePrefix
  | LowerCasePrefix -- may not be needed?
  | SuppressPrefix
  | NormalReference

The intent is to support using Ref like Cite is right now in the readers, to store a sequences of references from a compound reference and the text of what was parsed.

Slightly off-topic, but I have no idea what the citationNoteNum in Citation does. I'm not sure if it's used at all in the core pandoc packages. What is it for?

despresc commented 3 years ago

If numbers (meaning the full rendered number, like "2.4.1") were stored in the Attr of the numbered thing (to expose them to other filters), it would be wise to agree on a particular key for them. Having it be number is the easiest, I guess.

jgm commented 3 years ago

We already use number in sections (after makeSections), so yes, I agree on that.

With Ref, I guess your idea is that the Ref elements will be postprocessed by a filter or built-in transformation, as Citations are now. The [Inline] part will be replaced by the rendered reference. That makes a lot of sense to me.

citationNoteNum -- I don't think it is used. In pandoc-citeproc the citeNoteNumber is taken from it, but since (as far as I can see) it's always 0 this never makes any difference. This type originates from citeproc-hs and probably needs some adjusting, especially as I go forward with the new citeproc processor. I can see why a field such as this would be needed. Some styles include back references like "Op. cit., n. 13" where you have to know the note number in which a particular citation occurs. In my current citeproc implementation, we get these numbers by assuming one note per citation -- but of course that breaks if you have a document containing both citations and footnotes, and you're using a footnote citation style. In that case we'd need some way for pandoc to tell citeproc, "This citation would be the Nth rendered note." I see no reason why we can't simply use the existing field for this -- it's probably what it was intended for.

despresc commented 3 years ago

Yes, internal references are enough like citations that I thought the same sort of representation and handling would be good, since Cite seems to work well in practice.

I think for the writers that didn't support citations (all of them initially), the fallback would be exactly what the fallback for Cite is now: just attempt to render the [Inline] content if possible.

despresc commented 3 years ago

If citationNoteNum is intended for that purpose, then there probably won't be any need for the analogous referenceNoteNum. I'm not sure I've seen an ibid. used with a reference before.

tstenner commented 3 years ago

For some internal manuscripts I've put together a Lua filter that handles most cross references.

It currently assigns IDs to tables and equations based on attribute blocks at the end of the caption (i.e. : Caption for this table {#tab:example}) and surrounding spans for equations ([$$a^2+b^2=c^2$$]{#eq:pythagoras}). In the next step, citations starting with a prefix (fig:, tab: etc.) are replaced with a link to the element or natively counted references (LaTeX + docx).

It's not meant as serious competition to the excellent pandoc-xnos, but rather as testing ground for new features (i.e. table attributes) and pandoc-xnos compatible implementation for the most basic needs.

N0rbert commented 2 years ago

I tried to summarize current out-the-box pandoc LaTeX → docx experience in the question at StackOverflow. With test document and pandoc simple.tex --to docx --output simple.docx --table-of-contents --toc-depth 5 --number-sections --citeproc --verbose --csl ieee.csl command I obtained the following docx-rendering:

image

I see many strange things:

Hope you will provide official out-the-box pandoc solution for it without third-party filters and so on. Do we currently have a solution, which I probably missed?

jgm commented 2 years ago

@N0rbert one thing you're missing is the native_numbering extension. (The reason this isn't enabled by default is that it interferes with the popular filter pandoc-crossref.) If you do -t docx+native_numbering, then the situation improves a little bit: you get

Figure 1: [fig:image] Image

Table 1: [tab:table]

There's some low-hanging fruit here:

Of course, that still leaves us without good references to numbered equations (indeed, without numbered equations).

jjallaire commented 2 years ago

Yes, we could establish a protocol where filters set a specific metadata value to indicate that they have already handled numbering. Maybe for consistency w/ native_numbering we could set filter_numbering or filter-numbering (or filter_numbered, filter-numbered, etc.)

N0rbert commented 2 years ago

Thank you for quick reply. With -t docx+native_numbering document looks better. I'll keep an eye on next releases to check the changes provided by last two mentioned commits. Thanks!

BishopWolf commented 2 years ago

I'm using -t docx+native_numbering but the rendered docx file still does not contain any reference when using the \autoref{whatever-(equations, figures, sections, etc)}

N0rbert commented 1 year ago

With latest pandoc 3.1.2-1 on upcoming Debian 12 only equations are not numbered - resulting document has [eq:eq]. See below image:

pandoc3@debian12

Thanks!