jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.39k stars 3.37k forks source link

Feature Request: Add support for chunked (multiple file) HTML and HTMLHelp. #6122

Closed dm413 closed 1 year ago

dm413 commented 4 years ago

It would be useful if Pandoc could produce multiple output files by splitting the output based on sections (header) levels. The output files should maintain links across files, and the table of contents should link to all files.

You could also consider adding these as input formats. For chunked HTML, the issues seem to be what order to read the files, and making sure the links are correctly handled. For HTMLHelp (on Windows), the HTMLHelp reader can split a HTMLHelp (chm) file into the original discrete files for further processing in the same way as chunked HTML.

Note that the already supported epub format is another version of a chunked html format.

This issue has been raised in the pandoc-discuss mailing list. Various ideas have been proposed, including:

bpj commented 4 years ago

This is not really just HTML. You may want to chunk up a large Markdown file into smaller Markdown files too.

jgm commented 4 years ago

Hm, maybe the first step would be writing a format-independent function

splitIntoChunks :: FilePath -> Int -> Pandoc -> [(FilePath, Pandoc)]

where the Int parameter is the heading level to split at, and the FilePath is a file path template to be used (e.g. chapter-{{ number }}.html, where the {{ number }} will be replaced by the chunk number, or {{ heading }}.html where {{ heading }} will be replaced by the full heading text (stringified), or {{ identifier }}.html, where {{ identifier }} will be replaced by the identifier on the heading.

This function would split up the document into sections and rewrite any internal links so that they point to the correct paths. Not a hard thing to write.

Perhaps there should also be an option for adding "next," "previous," and "up" links to each chunk, as in the HTML output produced by texinfo? We could use arrows instead of the words "Next", "Previous", and "Top" to avoid English-centrism?

Just adding this to Shared would be helpful. Then we'd need to think about how to integrate it onto the command line. Perhaps the simplest approach would be this: if the output file is FILE.zip, then pandoc will create a zip file with chunked output in the specified output format (template FILE-#.FORMAT). So -t rst -o my.zip would produce a zip of chunked RST files, for example. A separate command line option could be provided to set the level for splitting, like the current --epub-chapter-level but more general. (Indeed, --epub-chapter-level could then be deprecated and replaced with this.)

dm413 commented 4 years ago

This would be very useful.

Outputting a zip file is simple, but the first thing any makefile or batch file is going to have to do is unzip it in order to further process it. How about just specifying a folder name instead of a file name? If the folder doesn't exist already perhaps you could add a trailing slash or backslash to indicate that it's a folder.

Or maybe just an option that means output a folder of files instead of one file (that is, chunked). This is more verbose, but it's clearer what you are doing.

An option for next, previous, and up links (using arrows) would be nice.

For HtmlHelp, we also need to create the project (.hhp), content (.hhc), and index (.hhk) files. Perhaps HtmlHelp is a a separate issue, and if you want me to create a new issue specifically for it I can do so. But any HtmlHelp writer will need to make use of the chunked html output option, so it's good to think about how to integrate both of these into the command line. For that matter, epub output is related as well.

jgm commented 4 years ago

How about just specifying a folder name instead of a file name? If the folder doesn't exist already perhaps you could add a trailing slash or backslash to indicate that it's a folder.

That's a possibility. I like the idea of keeping the simple invariant that pandoc produces one file, but I can see this would be m ore convenient.

bpj commented 4 years ago

An option for next, previous, and up links (using arrows) would be nice.

I think this is a job for the template, assuming each file would be run through the template separately. Pandoc could add metadata fields this-file: NAME, prev-file: NAME, next-file: NAME so that people can include and design those links if and as they want them in the template.

jgm commented 4 years ago

I think this is a job for the template

That makes sense to me!

hakan-geijer commented 4 years ago

In order to facilitate building static sites (or dumping to templates used in static sites such as jekyll or hugo), it would be useful to be able to specify a pattern for the output.

For example, I might want to run the command like:

pandoc -f markdown -t html5 \
  --chunks chapters --chunk-dest ~/projects/some-site/templates/my-book/ \
  {first,last,second}-chapter.md

And the output would be:

templates/
├── first-chapter.html
├── last-chapter.html
└── second-chapter.html

Or, there might be a way to specify other patterns so that someone could use config like

chunk-name: '{{ section[0]["name"] }}/{{ section[1]["name"] }}{{ ext }}'

To get output like first-chapter/first-section.html, first-chapter/second-section, etc.

rauschma commented 4 years ago

It may make sense for Pandoc to work with trees of files instead of single streams:

zspitz commented 3 years ago

This functionality might also be useful in filters.

jtbayly commented 3 years ago

I'm very excited for this possibility. Does being in "next release" mean that it is actually decided to implement it?

jgm commented 3 years ago

I'm afraid the "next release" tag has been aspirational so far... I would like to implement this, but it's going to take some thought.

jtbayly commented 3 years ago

Understood. That’s why I asked. Thanks so much for all your wonderful work.

ricopicone commented 3 years ago

It seems like bookdown somehow does this even though it's using Pandoc: pandoc in bookdown docs

The HTML output is split into different files and crossreferences work.

I guess this tells me there's some way of doing this now ... any ideas how?

ricopicone commented 3 years ago

It apparently happens here.

I don't know R and it's 1100 lines ... there's a lot going on here.

ricopicone commented 3 years ago

Fwiw, somebody made a pretty comprehensive filter-based version of multiple-output html files that fixes crossreference urls ... I haven't tested: https://groups.google.com/g/pandoc-discuss/c/bKhBB_uFW4o/m/uuLV7hMYCwAJ

barriteau commented 3 years ago

It seems like bookdown somehow does this even though it's using Pandoc: pandoc in bookdown docs

The HTML output is split into different files and crossreferences work.

I guess this tells me there's some way of doing this now ... any ideas how?

I'm pretty sure this isn't the best path, but epub files are made of multiple chunks of .xhtml, personally and for a while I've been doing this by generating .epub files with Pandoc and then using a task runner to automate the unzipping > extracting > parsing > processing > moving > fixing > renaming of the xhtml files as needed. That's an ugly hack I made a couple of years ago to solve this need and for a very specific case, maybe something like that could work for you meanwhile.

ricopicone commented 3 years ago

Thanks @barriteau -- do you have your code for that? As may be the case for others, I'm making large html docs and having performance issues. There's only so much improvement I can get out of lazy loading images and the like ... mostly it's MathJax. But there's no significant reason for it to be one-file other than Pandoc. A stop-gap solution until this feature is implemented would be most welcome :)

barriteau commented 3 years ago

Yup, but I'm afraid that in its actual conditions is of no use for you, it's an old Grunt task with a lot of extra and specific routines for other different stuff. I'll take a look to it to find if it's worth to clean it for sharing and reuse, I'll let you know :)

jtbayly commented 3 years ago

I've looked at how Bookdown does it before. Part of the reason it is so complicated is because it supports a fair number of Pandoc options, which changes the output that it then has to process. In fact, I use Bookdown currently. One of the things that makes me hopeful about Pandoc making this change is that it might fix a couple of problems I've got with Bookdown related to its splitting process.

jgm commented 3 years ago

I still think my Feb. 20 6 comment above gives a good route forward on this. Most of the technical issues have already been solved, since we already have to chunk things for EPUB. It would be good to have code that could simply be reused by the EPUB writer. I think the issue about "Next/Previous/Up" links could be solved simply by populating template variables; using a custom template, you could get whatever kind of navigation links you like.

So, rough plan would be

rauschma commented 3 years ago

@jgm Still a somewhat vague idea of mine – do you think it’s possible to make your ideas more general? For example:

jtbayly commented 3 years ago

Something else to consider:

In bookdown you can specify to split the HTML up by chapter, by section, or by file. I like that flexibility, fwiw, especially the split by file option. Split by chapter sometimes gives me way too long of webpages. Split by section sometimes leaves me with nothing but a chapter title on one webpage, and then you've got to go to the next webpage to get to the next section. Split by file lets me decide.

jgm commented 3 years ago

@rauschma we already have a MediaBag to contain assets used by the document. These get passed through the plumbing in PandocMonad, so we shouldn't need to represent them explicitly. But I take the core of your idea to be that we might want to support "trees" (directories containing multiple documents) in both input and output (my proposal above is output only). This would require, at least, the change noted in https://groups.google.com/g/pandoc-discuss/c/M_UPUFs1G6o/m/hKGN-V8YBwAJ.

jgm commented 3 years ago

@jtbayly - I don't know what "split by file" would really mean, when you're splitting up a Pandoc document. (It doesn't come chunked into files.)

jtbayly commented 3 years ago

But it accepts multiple files as input, doesn't it?

dm413 commented 3 years ago

jgm, I think you are referring to your Feb 6 comment, not Feb 20. <rant>I detest github's "relative" dates. When I see "commented 22 days ago", I have no idea when that was without looking at a calendar. And "2 months ago" is meaningless.</rant>

In terms of planning, how would the TOC be done, and could that be templated as well? I'm thinking formats such as epub and htmlhelp need a TOC file in one form or another, and it would be nice if the output zip file (or directory) contained the TOC information in a form that could be turned into the required file. Even if you only intend to use the chunked html as a static web-site, you probably want to generate a TOC someplace in your site, perhaps a banner or column on every page. This file should respect the --toc-depth option.

Another question I have is how would I create an index. Here I am referring to an alphabetical index like you might see at the end of a book, not a TOC. Epub, HtmlHelp, and pdf all support such a concept. AFAIK Pandoc does not support an index natively. This may be a separate issue, and off-topic here, but I'd be interested in any thoughts you have about how to do this, even if it involves a filter and/or post-processing the output zip file/directory.

jgm commented 3 years ago

@jtbayly Yes, you can specify multiple files as input; however, everything is concatenated before parsing, and the parser doesn't even know which parts come from which files (this could be improved by https://groups.google.com/g/pandoc-discuss/c/M_UPUFs1G6o/m/hKGN-V8YBwAJ); moreover, the AST doesn't contain slots to represent source positions. A 'Pandoc' is an abstract representation of a document; you can get the same 'Pandoc' from multiple files or from one.

jgm commented 3 years ago

@dm413 Yes, we need to figure out how to deal with the TOC. I think the simplest option is to generate a TOC for the whole document (tree) and put it in one of the generated files. But this may not be the best approach if you want the TOC in a side banner.

As for an index, that's a separate issue in a way, since you could want an index even with non-chunked output. Currently there's no built in way to construct one, but it's certainly possible to use a filter to define an indexing system. One difficulty with building in a general index system is that the requirements tend to be format-dependent. IF you want, you can create a separate issue for indexes on this tracker (if there isn't one already).

dm413 commented 3 years ago

I did a quick search, there is issue #6415 Built-in support for indices?

jtbayly commented 3 years ago

the parser doesn't even know which parts come from which files (this could be improved by https://groups.google.com/g/pandoc-discuss/c/M_UPUFs1G6o/m/hKGN-V8YBwAJ);

Interesting proposal.

I took a look at the bookdown code, since I wondered how they did it, given what you said about how Pandoc works. Apparently they add an HTML comment everywhere a split needs to happen before sending it to Pandoc, then they parse it afterwards using those comments to figure out where to split.

ricopicone commented 3 years ago

@jtbayly I wonder if some user-entered "split-here" command would be most flexible, in addition to chapter- and section-splits. It might be harder to implement than chapter- or section-splits, but it wouldn't depend on introducing a source file structure abstraction. What do you think?

jtbayly commented 3 years ago

I can’t think of any downsides, personally. I guess it’ll depend on the project owner/programmer whether something like that is actually within scope of Pandoc. I’d personally be in favor, though.

asmaier commented 2 years ago

It seems like bookdown somehow does this even though it's using Pandoc: pandoc in bookdown docs

The HTML output is split into different files and crossreferences work.

I guess this tells me there's some way of doing this now ... any ideas how?

For what it's worth also LaTeXML is capable of splitting the output into several html files:

For larger documents, it is often desirable to break the result into several interlinked pages. This split, carried out before scanning, is requested by

--splitat=level where level is one of chapter, section, subsection, or subsubsection. For example, section would split the document into chapters (if any) and sections, along with separate bibliography, index and any appendices.

see https://math.nist.gov/~BMiller/LaTeXML/manual/usage/splitting/

They even support a more complex scenario:

A more complicated situation combines several TeX sources into a single interlinked site consisting of multiple pages and a composite index and bibliography.

see https://math.nist.gov/~BMiller/LaTeXML/manual/usage/site/

jgm commented 2 years ago

I guess this tells me there's some way of doing this now ... any ideas how?

See my Feb. 6 comment. We already do similar splitting in the EPUB writer. There are no big mysteries about how to do it. It's a matter of making decisions about the architecture nad then actually implementing it.

rauschma commented 2 years ago

For chunking, I’d prefer:

Doing templating well, is difficult and maybe better done via an external general-purpose programming language (vs. via configuring Pandoc declaratively).

Output:

Input – two options:

Open question:

jgm commented 1 year ago

I've written an experimental Chunks module for generic chunk-ing (issue6122 branch). Next step is to try to use this in the EPUB writer and iron out the kinks. Then a chunked HTML writer should be in easy reach.

jgm commented 1 year ago

I'm working on this feature now in the chunkedhtml branch.

jgm commented 1 year ago

Please see https://pandoc.org/chunkedhtml-demo/ for a demo of the current code. Comments welcome.

dm413 commented 1 year ago

Thanks for your work on this.

The demo output looks great. The cross-page links work. There are navigation links at the top. It's quite usable.

How does this work in practice? Is any of this template driven? What command line options exist? For example,

  1. Can we disable or change the navigation links at the top of each page?
  2. Can we suppress the TOC at the beginning if we have a different TOC structure?

One issue for me -- Is the TOC available in a format that can be massaged into other formats. For example:

  1. For HTMLHelp, we need to create project (.hhp), content (.hhc), and index (.hhk) files. These need the names and path of all files, and the hierarchical structure.

  2. For a chunked html web site, we might want to have a navigation panel on the side that allows you to move through the document. How could we produce that?

  3. For epub, we similarly need to have navigation files.

Can any of this be done with templates?

You may not have gotten to this stuff yet -- which is fine. Just want to see where we are and how this might develop. Thanks!

tarleb commented 1 year ago

We have the open PR #8485: it needs adjustments if branch chunkedhtml gets merged, but it would make it possible to do what you need with a small amount of Lua code.

jgm commented 1 year ago

How does this work in practice? Is any of this template driven? What command line options exist?

So far, the section splitting level is determined by --epub-chapter-level (which might need a more generic new name). The option --number-section also has an effect. A TOC is generated currently whether or not --toc is specified, but I'll probably change that.

For this demo I used

pandoc MANUAL.txt -t chunkedhtml -o my --epub-chapter-level=2 --template data/templates/default.chunkedhtml --toc-depth=3 --number-sections

Can we disable or change the navigation links at the top of each page?

Yes, all the link rendering is done in the template, so you can remove them or change them.

<nav id="sitenav">
<div class="sitenav">
<span class="navlink">
$if(up.url)$
Up: <a href="$up.url$" accesskey="u" rel="up">$up.title$</a>
$endif$
</span>
<span class="navlink">
$if(top)$
Top: <a href="$top.url$" accesskey="t" rel="top">$if(toc-title)$$toc-title$$else$Contents$endif$</a>
$endif$
</span>
</div>
<div class="sitenav">
<span class="navlink">
$if(next.url)$
Next: <a href="$next.url$" accesskey="n" rel="next">$next.title$</a>
$endif$
</span>
<span class="navlink">
$if(previous.url)$
Previous: <a href="$previous.url$" accesskey="p" rel="previous">$previous.title$</a>
$endif$
</span>
</div>
</nav>

Can we suppress the TOC at the beginning if we have a different TOC structure?

I should implement sensitivity to --toc.

One issue for me -- Is the TOC available in a format that can be massaged into other formats. > For example: For HTMLHelp, we need to create project (.hhp), content (.hhc), and index (.hhk) files. These need the names and path of all files, and the hierarchical structure.

We do have a data structure with all of this. I could maybe provide it in JSON form as a template variable? Not sure what would be the best way to make it available. Perhaps having it accessible from Lua is best.

For a chunked html web site, we might want to have a navigation panel on the side that allows you to move through the document. How could we produce that?

I plan to modify the templates to make it possible to include the TOC on every page if you want. (This would be the full TOC, though, not, say, just a section. I think that's what is most useful, no?)

dm413 commented 1 year ago

all the link rendering is done in the template, so you can remove them or change them.

awesome.

We do have a data structure with all of this. I could maybe provide it in JSON form as a template variable?

I don't know enough about Pandoc templating to know whether we could use that to generate the files. If so, that would be a nice solution.

Not sure what would be the best way to make it available. Perhaps having it accessible from Lua is best.

I'm not sure what's the best method either. I haven't done much with templates in Pandoc, so I'm probably not the best person to consult about this. It would be great if we could directly implement the HTMLHelp project and content files using templates, though I'm guessing we'd have to run Pandoc three times, once to generate the content (html files), and twice more to generate the project and content files (each time with a different template). I don't know if the template can do that -- the content file is kind-of xml, but the project file looks more like an ini file.

Lua is also an option, and could presumably generate the content and project files in a single pass with the html.

plan to modify the templates to make it possible to include the TOC on every page if you want. (This would be the full TOC, though, not, say, just a section. I think that's what is most useful, no?)

I agree the full TOC is what most people would want. I'm not so sure about including it on every page though. Isn't that usually done by generating a separate navigation file, and referencing that in an iframe or something? I haven't done this sort of thing in many years, and html and css have changed quite a bit in the meantime, so maybe it's done differently now.

jgm commented 1 year ago

I'm setting it up to produce a json hierarchical sitemap in the same directory. You can consume this with a program.

jgm commented 1 year ago

I'm calling this issue closed. Please test using the nightly at https://github.com/jgm/pandoc/actions/runs/3928284770

jgm commented 1 year ago

Reopening to explore this idea: I had introduced --split-level to replace --epub-chapter-level and also affect chunked HTML. This determines the header level at which documents are split into separate files. As implemented, it is currently independent of --toc-depth.

It occurred to me that we might be able to simplify this. Suppose we said that the splitting of chunked HTML output was determined by --toc-depth. That would support the natural assumption that each entry in the TOC would take you to an independent chunk (and not a fragment in a chunk). Then we could remove --split-level and un-deprecate --epub-chapter-level.

So the question is whether there's any reason to allow chunking that is less fine-grained or more-fine-grained than the TOC-depth. For example,

jtbayly commented 1 year ago

I can say with certainty that the flexibility would be beneficial to me. In particular, it would look like the second scenario you outlined. Or it would look like something I described above, where I just want to be able to manually control where the splits happen. Sometimes in the same book a second level header is followed immediately by a third level header, without any paragraphs in between and other times it has what amounts to its own chapter. There’s no good way to assume when the 2nd level header should be broken out separately to its own page or be bundled in with the following 3rd level header and its content.

Manual is how I want to be able to do it.

dm413 commented 1 year ago

Most of the time I want the TOC to match the chunking level (--split-level = --toc-depth).

But occasionally I have lower level sections that I want in the TOC but don't want to make into a separate chunk (--split-level < --toc-depth, your second scenario). Basically I want these sections to be viewed within the surrounding context of the page they are shown on, but I want them to appear in the TOC.

So I would prefer to keep both these command line options. If it would be possible to make --split-level default to --toc-depth when --split-level is not specified, that would be convenient because most of the time they are the same. (But not always.)

Note that none of these options allow you to split at different levels in different parts of the document, or have different toc depths in different parts of the document. Which seems to be what @jtbayly is looking for? This has never been possible in pandoc; the --toc-depth and --epub-chapter-level (now --split-level) are fixed for the entire document. I admit that there have been times when I've wanted to be able to change the --toc-depth in a document, but I've always been able to work around it and I think trying to provide this would add unnecessary complexity.

I haven't had a chance to try out the nightly build yet. I'll do that in the next day or so. Thanks for your work on this.

ricopicone commented 1 year ago

So the question is whether there's any reason to allow chunking that is less fine-grained or more-fine-grained than the TOC-depth. For example,

  • more fine-grained: chunks are split on level-2 sections, but the TOC only contains entries for level-1 sections
  • less fine-grained: chunks are split on level-2 sections, but the TOC includes entries for level-3 sections (these link to a fragment on one of the chunks)

I prefer the flexibility of being able to do chunking level independently of TOC level. I can imagine wanting more fine-grained and less fine-grained chunking than the TOC, depending on my goals with the TOC. (I'll try to test the nightly build soon—thanks for your work on this @jgm!)

jtbayly commented 1 year ago

To clarify, I don't need variable or manually modifiable TOC depth. I just want to be able to specify which chunks go together on a single page, irregardless of the TOC settings. This seems to be the same as you, @dm413, if I'm reading you correctly.

jgm commented 1 year ago

Manual chunking isn't available -- I'm not even sure how that would be indicated. But it's true that without it you can get awkward pages that just have a title heading (when the next thing is a section that goes in another chunk).