Closed dm413 closed 1 year ago
This is not really just HTML. You may want to chunk up a large Markdown file into smaller Markdown files too.
Hm, maybe the first step would be writing a format-independent function
splitIntoChunks :: FilePath -> Int -> Pandoc -> [(FilePath, Pandoc)]
where the Int parameter is the heading level to split at, and the FilePath is a file path template to be used (e.g. chapter-{{ number }}.html
, where the {{ number }}
will be replaced by the chunk number, or {{ heading }}.html
where {{ heading }}
will be replaced by the full heading text (stringified), or {{ identifier }}.html
, where {{ identifier }}
will be replaced by the identifier on the heading.
This function would split up the document into sections and rewrite any internal links so that they point to the correct paths. Not a hard thing to write.
Perhaps there should also be an option for adding "next," "previous," and "up" links to each chunk, as in the HTML output produced by texinfo? We could use arrows instead of the words "Next", "Previous", and "Top" to avoid English-centrism?
Just adding this to Shared would be helpful. Then we'd need to think about how to integrate it onto the command line. Perhaps the simplest approach would be this: if the output file is FILE.zip
, then pandoc will create a zip file with chunked output in the specified output format (template FILE-#.FORMAT
). So -t rst -o my.zip
would produce a zip of chunked RST files, for example. A separate command line option could be provided to set the level for splitting, like the current --epub-chapter-level
but more general. (Indeed, --epub-chapter-level
could then be deprecated and replaced with this.)
This would be very useful.
Outputting a zip file is simple, but the first thing any makefile or batch file is going to have to do is unzip it in order to further process it. How about just specifying a folder name instead of a file name? If the folder doesn't exist already perhaps you could add a trailing slash or backslash to indicate that it's a folder.
Or maybe just an option that means output a folder of files instead of one file (that is, chunked). This is more verbose, but it's clearer what you are doing.
An option for next, previous, and up links (using arrows) would be nice.
For HtmlHelp, we also need to create the project (.hhp), content (.hhc), and index (.hhk) files. Perhaps HtmlHelp is a a separate issue, and if you want me to create a new issue specifically for it I can do so. But any HtmlHelp writer will need to make use of the chunked html output option, so it's good to think about how to integrate both of these into the command line. For that matter, epub output is related as well.
How about just specifying a folder name instead of a file name? If the folder doesn't exist already perhaps you could add a trailing slash or backslash to indicate that it's a folder.
That's a possibility. I like the idea of keeping the simple invariant that pandoc produces one file, but I can see this would be m ore convenient.
An option for next, previous, and up links (using arrows) would be nice.
I think this is a job for the template, assuming each file would be run
through the template separately. Pandoc could add metadata fields
this-file: NAME
, prev-file: NAME
, next-file: NAME
so that people can
include and design those links if and as they want them in the template.
I think this is a job for the template
That makes sense to me!
In order to facilitate building static sites (or dumping to templates used in static sites such as jekyll
or hugo
), it would be useful to be able to specify a pattern for the output.
For example, I might want to run the command like:
pandoc -f markdown -t html5 \
--chunks chapters --chunk-dest ~/projects/some-site/templates/my-book/ \
{first,last,second}-chapter.md
And the output would be:
templates/
├── first-chapter.html
├── last-chapter.html
└── second-chapter.html
Or, there might be a way to specify other patterns so that someone could use config like
chunk-name: '{{ section[0]["name"] }}/{{ section[1]["name"] }}{{ ext }}'
To get output like first-chapter/first-section.html
, first-chapter/second-section
, etc.
It may make sense for Pandoc to work with trees of files instead of single streams:
This functionality might also be useful in filters.
I'm very excited for this possibility. Does being in "next release" mean that it is actually decided to implement it?
I'm afraid the "next release" tag has been aspirational so far... I would like to implement this, but it's going to take some thought.
Understood. That’s why I asked. Thanks so much for all your wonderful work.
It seems like bookdown somehow does this even though it's using Pandoc: pandoc in bookdown docs
The HTML output is split into different files and crossreferences work.
I guess this tells me there's some way of doing this now ... any ideas how?
It apparently happens here.
I don't know R and it's 1100 lines ... there's a lot going on here.
Fwiw, somebody made a pretty comprehensive filter-based version of multiple-output html files that fixes crossreference urls ... I haven't tested: https://groups.google.com/g/pandoc-discuss/c/bKhBB_uFW4o/m/uuLV7hMYCwAJ
It seems like bookdown somehow does this even though it's using Pandoc: pandoc in bookdown docs
The HTML output is split into different files and crossreferences work.
I guess this tells me there's some way of doing this now ... any ideas how?
I'm pretty sure this isn't the best path, but epub files are made of multiple chunks of .xhtml, personally and for a while I've been doing this by generating .epub files with Pandoc and then using a task runner to automate the unzipping > extracting > parsing > processing > moving > fixing > renaming of the xhtml files as needed. That's an ugly hack I made a couple of years ago to solve this need and for a very specific case, maybe something like that could work for you meanwhile.
Thanks @barriteau -- do you have your code for that? As may be the case for others, I'm making large html docs and having performance issues. There's only so much improvement I can get out of lazy loading images and the like ... mostly it's MathJax. But there's no significant reason for it to be one-file other than Pandoc. A stop-gap solution until this feature is implemented would be most welcome :)
Yup, but I'm afraid that in its actual conditions is of no use for you, it's an old Grunt task with a lot of extra and specific routines for other different stuff. I'll take a look to it to find if it's worth to clean it for sharing and reuse, I'll let you know :)
I've looked at how Bookdown does it before. Part of the reason it is so complicated is because it supports a fair number of Pandoc options, which changes the output that it then has to process. In fact, I use Bookdown currently. One of the things that makes me hopeful about Pandoc making this change is that it might fix a couple of problems I've got with Bookdown related to its splitting process.
I still think my Feb. 20 6 comment above gives a good route forward on this. Most of the technical issues have already been solved, since we already have to chunk things for EPUB. It would be good to have code that could simply be reused by the EPUB writer. I think the issue about "Next/Previous/Up" links could be solved simply by populating template variables; using a custom template, you could get whatever kind of navigation links you like.
So, rough plan would be
splitIntoChunks :: FilePathTemplate -> Level -> Pandoc -> [(FilePath, Pandoc)]
. Look at the EPUB writer's splitting code in implementing this..zip
container for any output format, as follows:
splitIntoChunks
to split the document into chunks,@jgm Still a somewhat vague idea of mine – do you think it’s possible to make your ideas more general? For example:
[(FilePath, Chunk)]
.
Chunk
is either:Pandoc
FileData
. Not sure what exactly that type would look like. Sometimes data in RAM, sometimes a reference to a file on a hard drive?transformationFunction :: [(FilePath, Chunk)] -> [(FilePath, Chunk)]
FileData
to Pandoc
and back.Something else to consider:
In bookdown you can specify to split the HTML up by chapter, by section, or by file. I like that flexibility, fwiw, especially the split by file option. Split by chapter sometimes gives me way too long of webpages. Split by section sometimes leaves me with nothing but a chapter title on one webpage, and then you've got to go to the next webpage to get to the next section. Split by file lets me decide.
@rauschma we already have a MediaBag to contain assets used by the document. These get passed through the plumbing in PandocMonad, so we shouldn't need to represent them explicitly. But I take the core of your idea to be that we might want to support "trees" (directories containing multiple documents) in both input and output (my proposal above is output only). This would require, at least, the change noted in https://groups.google.com/g/pandoc-discuss/c/M_UPUFs1G6o/m/hKGN-V8YBwAJ.
@jtbayly - I don't know what "split by file" would really mean, when you're splitting up a Pandoc document. (It doesn't come chunked into files.)
But it accepts multiple files as input, doesn't it?
jgm, I think you are referring to your Feb 6 comment, not Feb 20. <rant>I detest github's "relative" dates. When I see "commented 22 days ago", I have no idea when that was without looking at a calendar. And "2 months ago" is meaningless.</rant>
In terms of planning, how would the TOC be done, and could that be templated as well? I'm thinking formats such as epub and htmlhelp need a TOC file in one form or another, and it would be nice if the output zip file (or directory) contained the TOC information in a form that could be turned into the required file. Even if you only intend to use the chunked html as a static web-site, you probably want to generate a TOC someplace in your site, perhaps a banner or column on every page. This file should respect the --toc-depth
option.
Another question I have is how would I create an index. Here I am referring to an alphabetical index like you might see at the end of a book, not a TOC. Epub, HtmlHelp, and pdf all support such a concept. AFAIK Pandoc does not support an index natively. This may be a separate issue, and off-topic here, but I'd be interested in any thoughts you have about how to do this, even if it involves a filter and/or post-processing the output zip file/directory.
@jtbayly Yes, you can specify multiple files as input; however, everything is concatenated before parsing, and the parser doesn't even know which parts come from which files (this could be improved by https://groups.google.com/g/pandoc-discuss/c/M_UPUFs1G6o/m/hKGN-V8YBwAJ); moreover, the AST doesn't contain slots to represent source positions. A 'Pandoc' is an abstract representation of a document; you can get the same 'Pandoc' from multiple files or from one.
@dm413 Yes, we need to figure out how to deal with the TOC. I think the simplest option is to generate a TOC for the whole document (tree) and put it in one of the generated files. But this may not be the best approach if you want the TOC in a side banner.
As for an index, that's a separate issue in a way, since you could want an index even with non-chunked output. Currently there's no built in way to construct one, but it's certainly possible to use a filter to define an indexing system. One difficulty with building in a general index system is that the requirements tend to be format-dependent. IF you want, you can create a separate issue for indexes on this tracker (if there isn't one already).
I did a quick search, there is issue #6415 Built-in support for indices?
the parser doesn't even know which parts come from which files (this could be improved by https://groups.google.com/g/pandoc-discuss/c/M_UPUFs1G6o/m/hKGN-V8YBwAJ);
Interesting proposal.
I took a look at the bookdown code, since I wondered how they did it, given what you said about how Pandoc works. Apparently they add an HTML comment everywhere a split needs to happen before sending it to Pandoc, then they parse it afterwards using those comments to figure out where to split.
@jtbayly I wonder if some user-entered "split-here" command would be most flexible, in addition to chapter- and section-splits. It might be harder to implement than chapter- or section-splits, but it wouldn't depend on introducing a source file structure abstraction. What do you think?
I can’t think of any downsides, personally. I guess it’ll depend on the project owner/programmer whether something like that is actually within scope of Pandoc. I’d personally be in favor, though.
It seems like bookdown somehow does this even though it's using Pandoc: pandoc in bookdown docs
The HTML output is split into different files and crossreferences work.
I guess this tells me there's some way of doing this now ... any ideas how?
For what it's worth also LaTeXML is capable of splitting the output into several html files:
For larger documents, it is often desirable to break the result into several interlinked pages. This split, carried out before scanning, is requested by
--splitat=level
where level is one of chapter, section, subsection, or subsubsection. For example, section would split the document into chapters (if any) and sections, along with separate bibliography, index and any appendices.see https://math.nist.gov/~BMiller/LaTeXML/manual/usage/splitting/
They even support a more complex scenario:
A more complicated situation combines several TeX sources into a single interlinked site consisting of multiple pages and a composite index and bibliography.
see https://math.nist.gov/~BMiller/LaTeXML/manual/usage/site/
I guess this tells me there's some way of doing this now ... any ideas how?
See my Feb. 6 comment. We already do similar splitting in the EPUB writer. There are no big mysteries about how to do it. It's a matter of making decisions about the architecture nad then actually implementing it.
For chunking, I’d prefer:
<body>
).<!DOCTYPE html>
to <body>
).Doing templating well, is difficult and maybe better done via an external general-purpose programming language (vs. via configuring Pandoc declaratively).
Output:
<title>
).[](#id)
links become cross-file links.Input – two options:
Open question:
I've written an experimental Chunks module for generic chunk-ing (issue6122 branch). Next step is to try to use this in the EPUB writer and iron out the kinks. Then a chunked HTML writer should be in easy reach.
I'm working on this feature now in the chunkedhtml
branch.
Please see https://pandoc.org/chunkedhtml-demo/ for a demo of the current code. Comments welcome.
Thanks for your work on this.
The demo output looks great. The cross-page links work. There are navigation links at the top. It's quite usable.
How does this work in practice? Is any of this template driven? What command line options exist? For example,
One issue for me -- Is the TOC available in a format that can be massaged into other formats. For example:
For HTMLHelp, we need to create project (.hhp), content (.hhc), and index (.hhk) files. These need the names and path of all files, and the hierarchical structure.
For a chunked html web site, we might want to have a navigation panel on the side that allows you to move through the document. How could we produce that?
For epub, we similarly need to have navigation files.
Can any of this be done with templates?
You may not have gotten to this stuff yet -- which is fine. Just want to see where we are and how this might develop. Thanks!
We have the open PR #8485: it needs adjustments if branch chunkedhtml
gets merged, but it would make it possible to do what you need with a small amount of Lua code.
How does this work in practice? Is any of this template driven? What command line options exist?
So far, the section splitting level is determined by --epub-chapter-level
(which might need a more generic new name). The option --number-section
also has an effect. A TOC is generated currently whether or not --toc
is specified, but I'll probably change that.
For this demo I used
pandoc MANUAL.txt -t chunkedhtml -o my --epub-chapter-level=2 --template data/templates/default.chunkedhtml --toc-depth=3 --number-sections
Can we disable or change the navigation links at the top of each page?
Yes, all the link rendering is done in the template, so you can remove them or change them.
<nav id="sitenav">
<div class="sitenav">
<span class="navlink">
$if(up.url)$
Up: <a href="$up.url$" accesskey="u" rel="up">$up.title$</a>
$endif$
</span>
<span class="navlink">
$if(top)$
Top: <a href="$top.url$" accesskey="t" rel="top">$if(toc-title)$$toc-title$$else$Contents$endif$</a>
$endif$
</span>
</div>
<div class="sitenav">
<span class="navlink">
$if(next.url)$
Next: <a href="$next.url$" accesskey="n" rel="next">$next.title$</a>
$endif$
</span>
<span class="navlink">
$if(previous.url)$
Previous: <a href="$previous.url$" accesskey="p" rel="previous">$previous.title$</a>
$endif$
</span>
</div>
</nav>
Can we suppress the TOC at the beginning if we have a different TOC structure?
I should implement sensitivity to --toc
.
One issue for me -- Is the TOC available in a format that can be massaged into other formats. > For example: For HTMLHelp, we need to create project (.hhp), content (.hhc), and index (.hhk) files. These need the names and path of all files, and the hierarchical structure.
We do have a data structure with all of this. I could maybe provide it in JSON form as a template variable? Not sure what would be the best way to make it available. Perhaps having it accessible from Lua is best.
For a chunked html web site, we might want to have a navigation panel on the side that allows you to move through the document. How could we produce that?
I plan to modify the templates to make it possible to include the TOC on every page if you want. (This would be the full TOC, though, not, say, just a section. I think that's what is most useful, no?)
all the link rendering is done in the template, so you can remove them or change them.
awesome.
We do have a data structure with all of this. I could maybe provide it in JSON form as a template variable?
I don't know enough about Pandoc templating to know whether we could use that to generate the files. If so, that would be a nice solution.
Not sure what would be the best way to make it available. Perhaps having it accessible from Lua is best.
I'm not sure what's the best method either. I haven't done much with templates in Pandoc, so I'm probably not the best person to consult about this. It would be great if we could directly implement the HTMLHelp project and content files using templates, though I'm guessing we'd have to run Pandoc three times, once to generate the content (html files), and twice more to generate the project and content files (each time with a different template). I don't know if the template can do that -- the content file is kind-of xml, but the project file looks more like an ini file.
Lua is also an option, and could presumably generate the content and project files in a single pass with the html.
plan to modify the templates to make it possible to include the TOC on every page if you want. (This would be the full TOC, though, not, say, just a section. I think that's what is most useful, no?)
I agree the full TOC is what most people would want. I'm not so sure about including it on every page though. Isn't that usually done by generating a separate navigation file, and referencing that in an iframe or something? I haven't done this sort of thing in many years, and html and css have changed quite a bit in the meantime, so maybe it's done differently now.
I'm setting it up to produce a json hierarchical sitemap in the same directory. You can consume this with a program.
I'm calling this issue closed. Please test using the nightly at https://github.com/jgm/pandoc/actions/runs/3928284770
Reopening to explore this idea: I had introduced --split-level
to replace --epub-chapter-level
and also affect chunked HTML. This determines the header level at which documents are split into separate files. As implemented, it is currently independent of --toc-depth
.
It occurred to me that we might be able to simplify this. Suppose we said that the splitting of chunked HTML output was determined by --toc-depth
. That would support the natural assumption that each entry in the TOC would take you to an independent chunk (and not a fragment in a chunk). Then we could remove --split-level
and un-deprecate --epub-chapter-level
.
So the question is whether there's any reason to allow chunking that is less fine-grained or more-fine-grained than the TOC-depth. For example,
I can say with certainty that the flexibility would be beneficial to me. In particular, it would look like the second scenario you outlined. Or it would look like something I described above, where I just want to be able to manually control where the splits happen. Sometimes in the same book a second level header is followed immediately by a third level header, without any paragraphs in between and other times it has what amounts to its own chapter. There’s no good way to assume when the 2nd level header should be broken out separately to its own page or be bundled in with the following 3rd level header and its content.
Manual is how I want to be able to do it.
Most of the time I want the TOC to match the chunking level (--split-level
= --toc-depth
).
But occasionally I have lower level sections that I want in the TOC but don't want to make into a separate chunk (--split-level
< --toc-depth
, your second scenario). Basically I want these sections to be viewed within the surrounding context of the page they are shown on, but I want them to appear in the TOC.
So I would prefer to keep both these command line options. If it would be possible to make --split-level
default to --toc-depth
when --split-level
is not specified, that would be convenient because most of the time they are the same. (But not always.)
Note that none of these options allow you to split at different levels in different parts of the document, or have different toc depths in different parts of the document. Which seems to be what @jtbayly is looking for? This has never been possible in pandoc; the --toc-depth
and --epub-chapter-level
(now --split-level
) are fixed for the entire document. I admit that there have been times when I've wanted to be able to change the --toc-depth
in a document, but I've always been able to work around it and I think trying to provide this would add unnecessary complexity.
I haven't had a chance to try out the nightly build yet. I'll do that in the next day or so. Thanks for your work on this.
So the question is whether there's any reason to allow chunking that is less fine-grained or more-fine-grained than the TOC-depth. For example,
- more fine-grained: chunks are split on level-2 sections, but the TOC only contains entries for level-1 sections
- less fine-grained: chunks are split on level-2 sections, but the TOC includes entries for level-3 sections (these link to a fragment on one of the chunks)
I prefer the flexibility of being able to do chunking level independently of TOC level. I can imagine wanting more fine-grained and less fine-grained chunking than the TOC, depending on my goals with the TOC. (I'll try to test the nightly build soon—thanks for your work on this @jgm!)
To clarify, I don't need variable or manually modifiable TOC depth. I just want to be able to specify which chunks go together on a single page, irregardless of the TOC settings. This seems to be the same as you, @dm413, if I'm reading you correctly.
Manual chunking isn't available -- I'm not even sure how that would be indicated. But it's true that without it you can get awkward pages that just have a title heading (when the next thing is a section that goes in another chunk).
It would be useful if Pandoc could produce multiple output files by splitting the output based on sections (header) levels. The output files should maintain links across files, and the table of contents should link to all files.
Chunked HTML output would produce a set or folder of HTML files. This is useful for generating static websites (for example).
HTMLHelp output is a compressed version of chunked HTML specific to windows. The way this is done by other tools (such as doxygen) is to generate a folder of chunked HTML along with a HTMLHelp project file and content file. And perhaps an index file, but I don't think Pandoc has a built-in concept of index terms, so I would skip this for now. These files are then run thru HTMLHelp Workshop, a Microsoft tool that is used to generate the HTMLHelp file.
HTMLHelp has its own pane for the TOC, generated from the content file. The content file should respect the pandoc toc-depth setting. Since there is a separate TOC pane, the normal TOC at the top of the file should be suppressed by default.
You could also consider adding these as input formats. For chunked HTML, the issues seem to be what order to read the files, and making sure the links are correctly handled. For HTMLHelp (on Windows), the HTMLHelp reader can split a HTMLHelp (chm) file into the original discrete files for further processing in the same way as chunked HTML.
Note that the already supported epub format is another version of a chunked html format.
This issue has been raised in the pandoc-discuss mailing list. Various ideas have been proposed, including:
Add "Next" and "Previous" links to each HTML output page. This probably needs to be an optional feature.
Extend the idea of chunked output to formats other than HTML. For example, individual chapters sent to separate ODT or DOCX files (or RST, markdown, etc.).