Split file - Githubissues

standardebooks / tools

The Standard Ebooks toolset for producing our ebook files.

Other

1.43k stars 126 forks source link

Split file #414

Closed vr8hub closed 3 years ago

vr8hub commented 3 years ago

The current method of splitting body into chapters has a couple of opportunities for improvement:

The step-by-step guide uses a shell command to insert the places to split instead of the se command, and
If the book has any frontmatter, it throws off the numbering of all the chapters, unless it is handled manually, in which case that becomes point 3,
Keeping frontmatter from interfering with the chapter numbering requires manual intervention.

I've long had an idea of addressing those, but spurred on by a recent post to the list and having three books in a row with frontmatter, I finally decided to tackle it. I have a working PR that uses two new parameters.

-g --header-tag takes the header tag you want to split on, i.e. h2, h3, etc.

-m --frontmatter-ids takes a comma-delimited list of frontmatter ids. These are exhausted before starting with chapter numbers, e.g. -m "foreword,preface" will use foreword as the id/filename of the first file, preface as the second, then start with chapter-X, etc., on the third.

How these are used depends on whether we continue to support manually entering the split tags. I can't imagine that anyone actually does that, but, hey, different strokes.

If we want to continue to allow for that, then the parameters would work like this.

With neither of the above tags (both optional, both defaulting to empty), split file works exactly as it does today, it splits based on the tags, which would be manually inserted. No tags, no split.
But, to split on <h2>, instead of having to do the extra shell step, then -g h2 would be specified. I haven't investigated whether there's a way for an arg to be optional (so defaulting to empty), but if specified take a default value. I would guess there's a way to do that and if this is what we want to do, I'll look into it; if so, then just an -g would default to h2, so se split-file -g would split on h2.
And, if there's frontmatter, then specify the id(s) in -m.

If we don't think manually adding the tags is needed, then the parameters are the same, but in that case -g would default to h2, i.e. se split-file would by default split on h2. And -m would work the same.

Thoughts?

[Edit: I changed the first parameter to -g last night because of conflict with help, but I forgot by this morning when I wrote this; corrected.]

acabal commented 3 years ago

I think the ultimate problem is that PG texts vary so widely that manual review and probably hand editing of the source html is always going to be required.

We have to at least open a text editor to decide which h# tag (if any!) represent the section divisions. While we're there we may as well use the editor's find and replace to add the split markers, so that we can review the results and make sure we're splitting at the right place. If h# aren't used for headers, which is not uncommon with PG texts (especially older ones), then we have to hand edit anyway.

Re. frontmatter, again we're in an editor to decide what IDs we're going to split on. But, PG often uses IDs like pg4324 to denote sections, not foreword or preface, so the script can't really know what template to wrap around those sections, or what to name them. What if there' no IDs at all? Or a missing ID on the first chapter, so the script doesn't know where to stop? But, since we're in an editor already, it's easy enough to just cut those sections out and put them in a temp file somewhere while we process the actual body text.

So I'm not convinced there's going to be a huge gain here, and at the same time we're adding a bunch of flags people have to learn, and toolset maintenance burden.

Ultimately, the haphazard way PG books are formatted is why we need a flexible approach like splitting at hand-inserted markers, instead of being able to formulaically specify a few command line params. Ironically, that's something we could do with a Standard Ebook!

acabal commented 3 years ago

If anything, maybe what would be most helpful is to edit the step by step guide to remind people that using sed to add those markers isn't a strict requirement, just a suggested approach. Using a text editor is just fine and probably how most people are going to do it anyway.

vr8hub commented 3 years ago

I am under no illusions that the way I do things are the way everyone does them. But you shouldn’t be, either. :)

I’ve done somewhere in the neighborhood of 25-30 books that have been everywhere from the first 100 or so at PG to fairly recent ones. I have manually added the split tags on exactly zero of those books, or even considered doing so. So while I'm more than aware that PG formatting is haphazard, I have yet to encounter one that doesn't have h# tags at the chapter breaks. And even if there are, then handling that book would not be affected at all by these changes, because the defaults would still do exactly what they do today. But that hypothetical book is an aberration, not the norm.

We obviously have to open body.xhtml in a text editor in order to clean it up for the first commit. But there’s no reason to force every user to do something that a command can do a lot easier—the user shouldn’t have to remember to do it, they shouldn’t have to take the extra step to do it, etc. We have to do an se clean, so if it handles where to break, too, then that is a clear gain for the user.

The frontmatter changes are using the ids to put on the files, not looking for them within the file. IOW, if the ids parm was “preface”, then the first h2 it encounters (regardless of what’s in the h2 tag) gets the id and filename of "preface". The contents of the h2 tag stay, just like they do today. We don’t use a different template today, and we don’t need to do so tomorrow. Everything stays exactly the same. But instead of having to 1) manually move the frontmatter out of the way, and then 2) manually edit those frontmatter files to put the template information in, all of that is handled automatically by something we're already doing. Those two manual steps are in fact the steps I’ve done for all my books until now, and while “it’s easy enough” is arguable, it’s still a pain, because it’s two manual steps in a process that has no reason to be manual. This change eliminates those two manual steps.

A “huge” gain? No. But it is a gain, for 99% of books and for all maintainers, including, especially, new ones. (There are two gains here: the first one, not having to add the split tags, is a gain on 99.9% of books. The second one, not having to manually deal with frontmatter, is a gain on what, maybe 50% of books? Whatever the % is, it saves two manual steps on every one of them.)

No one has to learn the flags. The default will continue to do exactly what it does now. If you’d rather handle things manually, it will continue to work that way. But if you don't, you can save yourself three or four manual steps. And I don't think there's much “maintenance burden" for split-file — the code hasn’t been touched since May 2020. :)

Like join-words from several months ago, I’m going to use this regardless, so this isn't a big deal to me one way or the other. (Pretty much everything I've said here applies to join-words as well.) I'm pushing back because I think you’re making every maintainer deal with manual steps that there’s no reason for them to have to worry about 99.9% of the time. Especially for new maintainers, reducing the steps they have to deal with manually helps them be more successful, and more likely to produce more books.

But, you're the boss. Close the issue if you still feel the same, I'll continue to use my changes, and we'll both go away happy and content. :) No harm, no foul.

acabal commented 3 years ago

In your suggested steps we're hand-editing the source anyway to add IDs to frontmatter, so since we're already in there doing hand editing it's just as easy to cut and paste the whole thing somewhere else while we manipulate the rest of the file. Especially since each frontmatter section will require a different template.

Likewise I don't think having a flag for h# is significantly easier than just a find-and-replace in the file we already have open anyway, and with the file open we have the benefit of correcting mistakes or inconsistencies in the source. After all, if we're comparing typing different things in to a terminal, well, that's what sed is--only just slightly more wordy than specifying flags and arguments, and not our support burden.

Additionally, we're introducing a conflicting way of viewing how files are split. In the existing way, splits occur at the inserted comments and affect the previous section and the following section (prev becomes chapter 1, following becomes chapter 2). In the proposed way for frontmatter, they would occur at inserted IDs and only affect the following section, not the preceding section. (id="preface" takes everything that follows as being the preface... but what happens to the preceding matter?) That's more cognitive overhead--now the tool has two different ways of modeling the same event, which is splitting a file.

Lastly the maintenance burden for the proposed flags would not be insignificant, which is really my main concern besides the added overhead of having more flags for the user to inspect even if they don't end up using them. So I'm going to decline this for now but that doesn't mean there might not be different approaches to the task of splitting files that we could explore later.

vr8hub commented 3 years ago

In your suggested steps we're hand-editing the source anyway to add IDs to frontmatter, so since we're already in there doing hand editing it's just as easy to cut and paste the whole thing somewhere else while we manipulate the rest of the file. Especially since each frontmatter section will require a different template.

No. Again, we’re not touching the source at all for this. The preface has an h2 just like the chapters do. (In all the PG books I’ve done.) Likewise I don't think having a flag for h# is significantly easier than just a find-and-replace in the file we already have open anyway, and with the file open we have the benefit of correcting mistakes or inconsistencies in the source. After all, if we're comparing typing different things in to a terminal, well, that's what sed is--only just slightly more wordy than specifying flags and arguments, and not our support burden.

Well, speaking as a maintainer, it is easier. :) And we’re not comparing anything. The h2’s are already in the file! Additionally, we're introducing a conflicting way of viewing how files are split. In the existing way, splits occur at the inserted comments and affect the previous section and the following section (prev becomes chapter 1, following becomes chapter 2). In the proposed way for frontmatter, they would occur at inserted IDs and only affect the following section, not the preceding section. (id="preface" takes everything that follows as being the preface... but what happens to the preceding matter?) That's more cognitive overhead--now the tool has two different ways of modeling the same event, which is splitting a file.

No. There is no difference in the split. The only difference is how the split is named. Today, every split is named chapter-X.xhtml. The first split is thus named chapter-1.xhtml, the second chapter-2.xhtml, etc. With the id specified, the first split is named preface.xhtml, and the second chapter-1.xhtml. The naming (of the file and the id) is the only thing that’s different. Lastly the maintenance burden for the proposed flags would not be insignificant, which is really my main concern besides the added overhead of having more flags for the user to inspect even if they don't end up using them. So I'm going to decline this for now but that doesn't mean there might not be different approaches to the task of splitting files that we could explore later.

Again, hasn’t been touched in a year. :) So yes, I believe the maintenance would be insignificant.

I’m not arguing here for inclusion, as I said, I’m fine that you declined. But you’re misrepresenting or misunderstanding what it’s doing (which is understandable, you haven’t seen the code), so I’m clarifying your points that are off-base from what’s actually happening.