Adding word index to documentation

lonetech commented 2 years ago

Suggested labels: documentation enhancement

I wanted an index of words while browsing the documentation, and in discussion #328 there was suggestion of making the documentation accessible on the target computer as well. I've done some experiments with TeX makeindex, and while it could neatly generate an index it gets rather tricky to do all the necessary quoting for symbols together with proper sorting. This is already an issue with names like ---modules---. Trying to define shorthand macros just exacerbated the issue, which admittedly may be because of my inexperience with TeX.

FWIW, these are the basic commands I applied to do TeX indexing:

\usepackage{imakeidx}
\index{-{-}-modules-{-}-}\texttt{---modules---}
\makeindex[intoc,columns=3]
\printindex

The rough cut of that experiment is at https://github.com/lonetech/durexforth/tree/doc-index-tex

I ran into things like words starting with - being offset to the left, @ disappearing and 2@ showing as 2, " breaking the index, etc. (Both those can be escaped with a " in front.)

However, this is a challenge others have already faced.

Texinfo was always designed for online documentation combined with TeX typesetting for print, specifically for programs; it has the concept of code in indices, with only three special characters @{} all escaped with @. I did a rough bit of conversion (starting point with pandoc --to texinfo) as a proof of concept, but want to open discussion before committing to converting the whole manual. pandoc makes a mess of some things, like putting # in anchors, but what's been automatically mangled can be automatically unmangled with a helpful macro in vim. I'll endeavour to document those, this is just proof of concept.

There are clear differences I've yet to address, like some red highlights in the word anatomy section. It's definitely possible to do those, but I haven't figured out how yet. Also, words defined with the @deffn command have to have a category, for which I simply reused the subsection. It shows up at the right. Particular differences show at case (four words in one description), pick (uses subscripts which looked so terrible in text output I made them PDF-only), value where the fetch isn't actually a word definition (looks okay in PDF, but bad in html or info - need to select a solution), and under Variables where I abused the C-like return type word to keep the metavariable foo in the definition.

https://github.com/lonetech/durexforth/tree/doc-index-texinfo

The texinfo version builds easily with texi2any --pdf manual.texi, no faffing about with multiple passes to get the index correct.

durexforth.pdf texinfo-durexforth-sample.pdf

By the way, that front page is pretty and all, but it takes up 70% of the manual file size. It looks like some vector graphics that could be far smaller with a bit of noise texture on. Perhaps we could check if some sort of TeX rendered noisy texture could be made lighter weight. For the texinfo question, however, suffice to say we can still include it as it is.

jkotlinski commented 2 years ago

Hallo! I understand your question like, "may I switch documentation system to Texinfo?". I think that is OK if there is a good enough reason. From what I read, I do not feel that having an improved index syntax is reason enough. Quoting special characters seems like an annoyance, not a major problem.

Documentation systems is an interesting topic though. I suppose these days Markdown is most popular thanks to its simplicity. I like LaTeX beause the PDF output looks good, although I suspect people do not really prefer PDF these days.

lonetech commented 2 years ago

Yes, that is my suggestion. Part of it is because Texinfo print formatting is included with TeX Live, so it's still the same family of layout tools at some level.

It's not only the quoting. In LaTeX the index entries (as far as I've tried) are distinct tags from the text, so it involved writing the entries twice, in different formats. This could be worked around with custom commands, but that's precisely what texinfo is (and I'm not skilled enough to do the same). The PDF renderer for Texinfo is pdftex, and the first line \input texinfo makes the Texinfo version a valid TeX document - albeit with a different DSL. I wasn't sure how critical the styling is, so I wanted to present a sample for comparison. I don't have a great eye for e.g. typeface styles. Is the LaTeX vs TeX difference large?

There is some extra verbiage in the Texinfo format, the menus. They aid in hypertext navigation but don't show in the print version. HTML example from Texinfo manual.

I was also thinking, once we have word indices and ctags extracting words (universal ctags config), we could automatically detect undocumented words. I'm greatly enjoying vim's tag support already, though I'm not sure what key ^] should be in V on C64. Perhaps just make :ta have a default argument of the current word for starters.

I looked some at Markdown (pandoc in particular) and Asciidoctor, but they didn't seem very helpful at index generation, which was what I wanted in the first place. Material for MkDocs boasts searchability but my second impression was an incredibly heavy web page.

On yet another hand, texi2pdf hints that LaTeX documents can also be formatted into HTML or info by HeVeA. I have more tools to look into. (Looks like that may be a red herring, those switches work on texi2any but not texi2pdf, afaict. Further look shows HeVeA on its own can produce info or html complete with index from LaTeX source, albeit atm with horrendous formatting.)

jkotlinski commented 2 years ago

All right! If you are convinced Texinfo is the logical and best solution, I see no reason to block progress :-) For typesetting, I noticed there are differences between the Texinfo sample and the original manual, but it might very well be a change for the better. Of course, details like red text is a nice touch, so I would prefer if that could be kept.

jkotlinski commented 2 years ago

About the cover image, I would not mind cleaning that up a bit to reduce file size. Maybe use PDF version and remove the white-on-blue text + logo.

lonetech commented 2 years ago

I'm not quite convinced. Right now I'm poking at HeVeA, and initial impressions are decent. It has the annoying property of putting commas after each index entry, but texinfo pretty much put colons after them all, so that's a bit of a wash (have a workaround for PDF). Hevea accepts the latex files pretty neatly, including the colours, but doesn't understand multicols (used in mnemonics.tex), but I haven't started on that in texinfo either. Hevea also named the chapters "Chapter N", rather than their titles.

The texinfo definition structure does have a few upsides, like automatic inclusion in the index, marking and distinguishing code and variables, and less risk of awkward line breaks for short descriptions. On the other hand, there's the noisy category mark. Oddly enough, while I expected the definitions would need more space, it seems that evened out with the smaller margins.

Hevea info index lacked the neat columns of texinfo output, but I might want to rework the file format anyway when building a C64 viewer. Text like "The editor (fully described in chapter 3*Note Chapter 3::)" looks a bit awkward and won't get any better in 40 columns.

The PDF cover image is already 1/10th the size of the JPG, and shrinks to half that size if saved as an optimized SVG with Inkscape.

Yet another potential concern down the road is optimizing for ereaders. There are a couple of tools from tex, not so much from texinfo. It may even be using Plastex is a better idea than reformatting info files, for generating C64 readable documentation.

I'm going to experiment a while yet.

jkotlinski commented 2 years ago

I suspect that in practice, the best is to have an online manual like Gforth. That one is Texinfo based, and seems like its glossary is auto-generated from source comments.

lonetech commented 2 years ago

Indeed, they use marked up sources, a variant of literate programming now popular with tools like doxygen. For durexForth we'd want to extend it with assembly versions of words too.

In a way, that method could be easier on the target system for the case of occasional lookups of specific words; we could just include the source as is, and look up words by their tags. A faster to load preprocessed version might be of use too.

Interestingly, the word definitions in Gforth's texinfo don't use the default deffn and such commands. They use a set of distinct indices (e.g. with or without stack effects, mixed with various other concepts and switches) and a format block, in a manner that would certainly be tedious if not tooled.

jkotlinski commented 2 years ago

Oddly enough, I just bumped into a GNU sticker while visiting a local museum. I view this as a positive omen, favoring the use of Texinfo.

On Wed, 27 Jul 2022 at 02:13, Yann Vernier @.***> wrote:

Indeed, they use marked up sources, a variant of literate programming now popular with tools like doxygen. For durexForth we'd want to extend it with assembly versions of words too.

In a way, that method could be easier on the target system for the case of occasional lookups of specific words; we could just include the source as is, and look up words by their tags. A faster to load preprocessed version might be of use too.

Interestingly, the word definitions in Gforth's texinfo don't use the default deffn and such commands. They use a set of distinct indices (e.g. with or without stack effects, mixed with various other concepts and switches) and a format block, in a manner that would certainly be tedious if not tooled.

— Reply to this email directly, view it on GitHub https://github.com/jkotlinski/durexforth/issues/448#issuecomment-1196117509, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY34O24TL5G2MPSYWND4XDVWB5MNANCNFSM54RLITWQ . You are receiving this because you commented.Message ID: @.***>

lonetech commented 2 years ago

And I think I found a modified makeinfo that generates amigaguide documents. Not portable C, but its existence is neat.

Sadly, I've been failing to replicate a couple of basic features. I still haven't got red, and I don't have flowing columns, despite them being used in the index. Manually formatted tables just aren't that fun for the mnemonics list, particularly when I'm targeting unknown formats like ereaders or web browsers. And hevea hardcodes commas onto our index entries.

plastex has trouble processing microtype, but that's specific to tex anyhow. It also didn't handle the multicol or red tags by default, and a bunch of link targets are missing when not splitting. Grandly, the website sample using "default settings" appears to use a theme which is not included.

tex4ht should probably not be discounted. make4ht had no problem with the red or columns. Adding a .mk4 file to run xindy made the index work, too. It could also make a usable epub (tex4ebook). Both were however using annoying formatting of the index (with commas and only one column); that should be fixable using a xindy style file. Experiments continue. One thing I found is that index links are apparently injected at ", [0-9]+" patterns when using make4ht. Which makes the annoying comma a requirement.

I'm also thinking info is like what we want, e.g. AmigaGuide is a close relative to it, but not quite ideal. We'll want to keep some concepts, like chunk splits and node indices, but some markup could be helpfully shrunk and we could use PETSCII colour codes. That way we can simply define a coloured region as a link, colorForth style. Converting to info before doing that sort of markup loses some metadata like where and how to word wrap.

Also, regarding asciidoc, the warning about index generation appeared limited to a particular toolchain. Index generation does work through docbook toolchains.

I apologize for the severe uncertainty. Regarding making an index, we do have the facility for the existing latex. I'm currently struggling with how to format the index in columns for HTML/ePub output.

jkotlinski commented 2 years ago

Maybe one column is acceptable for HTML output? After all, some HTML viewers have quite limited width (e.g. cell phones, C64)

lonetech commented 2 years ago

One column is the fallback, and indeed what I get in fbreader, koreader and okular. What I wanted to get rid of was Foliate showing corrupt columns (actually illegible). kobo amazingly got the inline format (all words starting with the same letter in one paragraph), probably a quirk of nesting p elements. kobo was also very slow to follow the links, but it does work. Foliate correctly previews the line with the index tag, but just jumps to the section (e.g. words.tex) when following the links. Several readers show the Index heading twice in their table of contents, but that's better than not having it listed.

Anyhow, I think I have tex4ht building html and epub tolerably now. The html version does have columns in the index, automatically adapting to the display size (down to 9em, I think). The title page isn't as pretty currently (actually an issue in the epub, which would do better with a good thumbnail). Some cleanups (no need to put hevea specific files in), some proofreading, and I think we might form it into a pull request.

I have made no attempts to read the manual using a C64 browser such as Singular or Hyperlink as yet.

jkotlinski / durexforth

Adding word index to documentation #448