jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.89k stars 3.34k forks source link

Built-in support for indices? #6415

Open rauschma opened 4 years ago

rauschma commented 4 years ago
mb21 commented 4 years ago

Are you talking about LaTeX indices, or...? I'm not very familiar with those.. What exactly is the feature request? A change to pandoc's latex output, or...?

rauschma commented 4 years ago

Are you talking about LaTeX indices, or...?

Indices are are crucial feature of print books (and are very helpful for digital books, too). This is an example in HTML (produced via my Lua filter): https://exploringjs.com/impatient-js/ch_index.html

What exactly is the feature request? A change to pandoc's latex output, or...?

Include the functionality in Pandoc that I have implemented via my filter:

One needs to support two “commands”:

mb21 commented 4 years ago

Thanks for the explanations. Seems somewhat related to https://github.com/jgm/pandoc/issues/813... ?

rauschma commented 4 years ago

Very loosely. So far, I have always put index terms into the top level of a section, never inside tables, figures, or headers.

rauschma commented 4 years ago

A few more details – the idea of creating an index is as follows:

  1. When you write about topic SomeTopic, you put \index{SomeTopic} next to it. You can think of it as a link target. It being next to the topic means that if the content ever moves, so does the link target.
  2. My filter collects all link targets and creates an alphabetically sorted list of topics = the index.

Step 2 is crucial and shouldn’t have to be done manually (=error-prone tedious work).

An index is similar to a table of contents in that it also provides quick access to content (but via topic, not via heading). This is especially important for print books where you don’t have full text search. Most non-fiction print books have indices. If they don’t, people complain on Amazon. 😀

bpj commented 1 year ago

@rauschma now when Pandoc's Lua API includes lpeg/re it would make sense to use an re pattern to parse \index{...} strings. An alternative might be to (ab)use Pandoc's citation syntax, something like @idx:SORTKEY[ACTUAL-TEXT] to allow (Pandoc) formatting in the actual-text although I fully understand that you probably don't want to change your work flow at this point. Also it would be great if "special" characters could be included in sort keys[^1]; I once wrote an implementation of Sort::ArbBiLex in MoonScript/Lua which "converts" sort keys to a string of hex numbers separated by dots to make it possible to use arbitrary sort orders with Lua's table.sort function. You can even sort non-Latin alphabets with it, although CJK not so much...

[^1]: Because e.g. in Swedish å, ä, ö sort at the end of the alphabet, unlike German where you can treat ä, ö ü as a, o, u or ae, oe, ue. In the Swedish case you can (and I do) cheat by using ~a ~e ~o, but such hacks are not possible for all languages/sort orders.

nickovs commented 1 year ago

It would really be great if Pandoc supported indices, for all of the reasons outlined above. In pretty much any non-fiction work of non-trivial length where readers might want to look up a topic, it is useful to have an index.

bpj commented 1 year ago

It's a can of worms though, since different languages have different sorting rules. Hopefully there is a Haskell library similar to Unicode::Collate/Unicode::Collate::Locale or Sort::ArbBiLex. I have ported the latter to MoonScript/Lua myself but I'd be loth to ask for either to be ported to Haskell as I'm unable to do it myself.

Also how would the index work? In a PDF/ebook you would want to refer to page numbers and link to the pages. In (a) webpage(s) you would want to link to locations which might be in another file, and in that case what should the link text look like? Moreover with HTML output you would want the index to look different depending on whether you output a single web page, multiple web pages, PDF or ebook. In some cases you would probably want to reference sections or paragraphs, which probably would mean that you would want to have section/paragraph numbers already sorted out.[^parnum]

[^parnum]: Which in turn probably means that you want section/paragraph numbers to be wrapped in spans with a class to make it easier to pick them up.

That's a lot of configuration and at the end of the day you might be better off using some external tool to build the index, in conjunction with either a filter or something builtin which produces the input to the external tool, a bit like makeindex works even if you wouldn't use makeindex itself due to its limitations.

nickovs commented 1 year ago

It's certainly got plenty of cases that need to be considered, but it's not really a can of worms. Many of the problems to be solved are pretty orthogonal to each other. It seems like the key things that are needed are:

Each of these chunks is distinct and their interfaces are pretty easy to define, so we should be able to take any can of worms, separate the worms, straighten them out and line them up neatly!

jgm commented 1 year ago

Hopefully there is a Haskell library similar to [Unicode::Collate]

There is my unicode-collation, which we use for proper sorting in citeproc.

barriteau commented 9 months ago

Maybe it got sense implementing this with three new Pandoc options? something like:

--abbreviation-index: include an automatically generated abbreviations index in the output document. This index would include abbreviations created with the existing markup for abbreviations (https://pandoc.org/MANUAL.html#extension-abbreviations), the ones in a custom abbreviation file specified with --abbreviations=FILE and those created with the markup for abbreviations I'm suggesting in https://github.com/jgm/pandoc/issues/9227.

--definition-index: include an automatically generated definitions index in the output document. This index would include abbreviations created with the existing markup for definition lists (https://pandoc.org/MANUAL.html#definition-lists) and those created with the markup for definitions I'm suggesting in https://github.com/jgm/pandoc/issues/9227.

--full-index: include an automatically generated definitions and abbreviations index in the output document.

Reference

glossaries LaTeX package: https://ctan.org/pkg/glossaries

makeidx LaTeX package: https://ctan.org/pkg/makeidx

makeindex LaTeX package: https://ctan.org/pkg/makeindex