Built-in support for indices?

rauschma commented 4 years ago

I have written a Lua filter that supports indices for LaTeX, HTML, EPUB, and other formats.
- Warning: I have never written Lua before, so my code is probably, let’s say, unidiomatic.
I generally appreciate the principle of delegating work to filters, but this feels like a case where built-in support would make sense.
- Additional benefit: Pandoc does a lot of work under the hood when it calls LaTeX. As soon as you use mkindex, you can’t use Pandoc anymore and have to manage LaTeX yourself (e.g. via latexmk).

mb21 commented 4 years ago

Are you talking about LaTeX indices, or...? I'm not very familiar with those.. What exactly is the feature request? A change to pandoc's latex output, or...?

rauschma commented 4 years ago

Are you talking about LaTeX indices, or...?

Indices are are crucial feature of print books (and are very helpful for digital books, too). This is an example in HTML (produced via my Lua filter): https://exploringjs.com/impatient-js/ch_index.html

What exactly is the feature request? A change to pandoc's latex output, or...?

Include the functionality in Pandoc that I have implemented via my filter:

LaTeX: let LaTeX handle the index generation, call mkindex as necessary when creating PDFs via LaTeX.
HTML: generate an index.

One needs to support two “commands”:

Adding the term “myterm” to the index, pointing to the current location:
- My filter uses LaTeX syntax: \index{myterm}
- Pandoc might use: [myterm]{.index}
Displaying the index:
- My filter uses LaTeX syntax: \printindex
- Pandoc might use: ## Index {.append-index}

mb21 commented 4 years ago

Thanks for the explanations. Seems somewhat related to https://github.com/jgm/pandoc/issues/813... ?

rauschma commented 4 years ago

Very loosely. So far, I have always put index terms into the top level of a section, never inside tables, figures, or headers.

rauschma commented 4 years ago

A few more details – the idea of creating an index is as follows:

When you write about topic SomeTopic, you put \index{SomeTopic} next to it. You can think of it as a link target. It being next to the topic means that if the content ever moves, so does the link target.
My filter collects all link targets and creates an alphabetically sorted list of topics = the index.

Step 2 is crucial and shouldn’t have to be done manually (=error-prone tedious work).

An index is similar to a table of contents in that it also provides quick access to content (but via topic, not via heading). This is especially important for print books where you don’t have full text search. Most non-fiction print books have indices. If they don’t, people complain on Amazon. 😀

bpj commented 1 year ago

@rauschma now when Pandoc's Lua API includes lpeg/re it would make sense to use an re pattern to parse \index{...} strings. An alternative might be to (ab)use Pandoc's citation syntax, something like @idx:SORTKEY[ACTUAL-TEXT] to allow (Pandoc) formatting in the actual-text although I fully understand that you probably don't want to change your work flow at this point. Also it would be great if "special" characters could be included in sort keys[^1]; I once wrote an implementation of Sort::ArbBiLex in MoonScript/Lua which "converts" sort keys to a string of hex numbers separated by dots to make it possible to use arbitrary sort orders with Lua's table.sort function. You can even sort non-Latin alphabets with it, although CJK not so much...

[^1]: Because e.g. in Swedish å, ä, ö sort at the end of the alphabet, unlike German where you can treat ä, ö ü as a, o, u or ae, oe, ue. In the Swedish case you can (and I do) cheat by using ~a ~e ~o, but such hacks are not possible for all languages/sort orders.

nickovs commented 1 year ago

It would really be great if Pandoc supported indices, for all of the reasons outlined above. In pretty much any non-fiction work of non-trivial length where readers might want to look up a topic, it is useful to have an index.

bpj commented 1 year ago

It's a can of worms though, since different languages have different sorting rules. Hopefully there is a Haskell library similar to Unicode::Collate/Unicode::Collate::Locale or Sort::ArbBiLex. I have ported the latter to MoonScript/Lua myself but I'd be loth to ask for either to be ported to Haskell as I'm unable to do it myself.

Also how would the index work? In a PDF/ebook you would want to refer to page numbers and link to the pages. In (a) webpage(s) you would want to link to locations which might be in another file, and in that case what should the link text look like? Moreover with HTML output you would want the index to look different depending on whether you output a single web page, multiple web pages, PDF or ebook. In some cases you would probably want to reference sections or paragraphs, which probably would mean that you would want to have section/paragraph numbers already sorted out.[^parnum]

[^parnum]: Which in turn probably means that you want section/paragraph numbers to be wrapped in spans with a class to make it easier to pick them up.

That's a lot of configuration and at the end of the day you might be better off using some external tool to build the index, in conjunction with either a filter or something builtin which produces the input to the external tool, a bit like makeindex works even if you wouldn't use makeindex itself due to its limitations.

nickovs commented 1 year ago

It's certainly got plenty of cases that need to be considered, but it's not really a can of worms. Many of the problems to be solved are pretty orthogonal to each other. It seems like the key things that are needed are:

A syntax for identifying the location of terms that need to be indexed, optionally providing a index term that is different to the text in the location. @rauschma Has already proposed such a syntax.
The ability to sort and collate index terms. Sorting Unicode text according to locale is a solved problem but it would be good to add a configuration switch to specify which locale to be used.
A system for resolving the target locations based on the formatted text. What this resolves to depends on the output format; for chunked HTML we need file names and anchor names while for LaTeX we can use references, but pretty much all the available output formats have some way to identify target locations and the format-specific ID will be generated by the particular formatter. The main complexity here is that for formats like ePub and PDF there is a danger that this might need two passes.
A renderer that turns resolved index target locations into neatly formatted index text. Parts of this are format-specific but there will be substantial commonality between formats, so it might be worth splitting it into a layout part and a renderer part.

Each of these chunks is distinct and their interfaces are pretty easy to define, so we should be able to take any can of worms, separate the worms, straighten them out and line them up neatly!

jgm commented 1 year ago

Hopefully there is a Haskell library similar to [Unicode::Collate]

There is my unicode-collation, which we use for proper sorting in citeproc.

barriteau commented 9 months ago

Maybe it got sense implementing this with three new Pandoc options? something like:

--abbreviation-index: include an automatically generated abbreviations index in the output document. This index would include abbreviations created with the existing markup for abbreviations (https://pandoc.org/MANUAL.html#extension-abbreviations), the ones in a custom abbreviation file specified with --abbreviations=FILE and those created with the markup for abbreviations I'm suggesting in https://github.com/jgm/pandoc/issues/9227.

--definition-index: include an automatically generated definitions index in the output document. This index would include abbreviations created with the existing markup for definition lists (https://pandoc.org/MANUAL.html#definition-lists) and those created with the markup for definitions I'm suggesting in https://github.com/jgm/pandoc/issues/9227.

--full-index: include an automatically generated definitions and abbreviations index in the output document.

Reference

glossaries LaTeX package: https://ctan.org/pkg/glossaries

makeidx LaTeX package: https://ctan.org/pkg/makeidx

makeindex LaTeX package: https://ctan.org/pkg/makeindex

jgm / pandoc

Built-in support for indices? #6415

Reference