standardebooks / tools

The Standard Ebooks toolset for producing our ebook files.
Other
1.42k stars 125 forks source link

Subtitles for volumes and parts missing from toc #150

Closed michael-77 closed 5 years ago

michael-77 commented 5 years ago

When I run print-toc on Les Miserables it is omitting the subtitles for volumes and parts (books in the parlance of Les Mis).

Additionally the word Volume and Book are wrapped onto the next line or the file.

<li>
    <a href="text/halftitle.xhtml">Les Misérables</a>
    <ol>
        <li>
            <a href="text/volume-1.xhtml">
Volume <span epub:type="z3998:roman">I</span></a>
            <ol>
                <li>
                    <a href="text/book-1-1.xhtml">
        Book <span epub:type="z3998:roman">I</span></a>
                    <ol>
                        <li>
                            <a href="text/chapter-1-1-1.xhtml"><span epub:type="z3998:roman">I</span>: <abbr>M.</abbr> Myriel</a>
                        </li>

This should be

<li>
    <a href="text/halftitle.xhtml">Les Misérables</a>
    <ol>
        <li>
            <a href="text/volume-1.xhtml">Volume <span epub:type="z3998:roman">I</span>: Fantine</a>
            <ol>
                <li>
                    <a href="text/book-1-1.xhtml">Book <span epub:type="z3998:roman">I</span>: A Just Man</a>
                    <ol>
                        <li>
                            <a href="text/chapter-1-1-1.xhtml"><span epub:type="z3998:roman">I</span>: <abbr>M.</abbr> Myriel</a>
                        </li>

To reproduce clone the repository at https://github.com/michael-77/victor-hugo_les-miserables_isabel-f-hapgood, run se print-toc . -i and compare the changes.

Oh, how I wish I'd had this when writing the toc for this monster - its a fabulous tool!

PS: I added the <abbr> tags back in by hand for now as I know that's in the pipeline.

michael-77 commented 5 years ago

@drgrigg - sorry forgot to tag you in this issue.

acabal commented 5 years ago

They're wrapping to the next line because you probably forgot the first child span. See the semantics manual for section headers with subtitles.

I think I told David that we don't want subtitles for book/parts elements, but now that I review the lint code I may have been mistaken in saying that. Let me look at this a little more.

acabal commented 5 years ago

Yeah I think I told @drgrigg might have been in error. In my notes for the draft unified manual, I have written Don't include subtitle of sections that are not numbered. For example, OK: "II: Some stuff happens" WRONG: "Epilogue: The end"; OK: "Epilogue"

So with that in mind, we should include subtitles for numbered books and parts as well as chapters. But we would not include subtitles if the section is not numbered. Lint checks for this successfully, and now that I run it on some of the new ToCs it's outputting an error.

I think this was my mistake David, how tough would it be to correct this behavior?

acabal commented 5 years ago

@drgrigg see se_epub_lint.py line 550 for a detailed comment on the algorithm lint uses to check the ToC. You can also see how lint does its parsing and checking, and you might find that helpful too.

drgrigg commented 5 years ago

Easy peasy, but it will have to wait until Monday as I'm out of town for a few days.

drgrigg commented 5 years ago

Well, now that I look at that lint comment, not quite so easy-peasy, but still no problem. But it will have to wait a day or two.

michael-77 commented 5 years ago

Ok, I can see what you'll mean about the missing span (although from another ebook which had recently had the TOC updated rather than the semantics pages of the website).

Thanks for the pointer.

drgrigg commented 5 years ago

I've adopted your logic in lint but not the actual code; the reason being that I think it's a good thing for lint and print-toc to use different methods to calculate what should be in the ToC. That way we'll test one method against the other and spot any unexpected behaviour.

I'm now running print-toc against the entire corpus, and of course it's highlighting a lot of interesting issues, which I'm slowly working through.

One such occurs in A Christmas Carol: https://github.com/standardebooks/charles-dickens_a-christmas-carol.git .

This has "Staves", and the existing ToC has:

<a href="text/halftitle.xhtml">A Christmas Carol</a>
                    <ol>
                        <li>
!                           <a href="text/chapter-1.xhtml">Stave <span epub:type="z3998:roman">I</span>: Marley’s Ghost</a>
                        </li>
                        <li>
!                           <a href="text/chapter-2.xhtml">Stave <span epub:type="z3998:roman">II</span>: The First of the Three Spirits</a>
                        </li>
                        <li>
!                           <a href="text/chapter-3.xhtml">Stave <span epub:type="z3998:roman">III</span>: The Second of the Three Spirits</a>
                        </li>
                        <li>
!                           <a href="text/chapter-4.xhtml">Stave <span epub:type="z3998:roman">IV</span>: The Last of the Spirits</a>
                        </li>
                        <li>
!                           <a href="text/chapter-5.xhtml">Stave <span epub:type="z3998:roman">V</span>: The End of It</a>
                        </li>
                    </ol>
                </li>

Whereas my current code ignores the subtitles because it's not a Part or Division:

                    <a href="text/halftitle.xhtml">A Christmas Carol</a>
                    <ol>
                        <li>
!                           <a href="text/chapter-1.xhtml">Stave <span epub:type="z3998:roman">I</span></a>
                        </li>
                        <li>
!                           <a href="text/chapter-2.xhtml">Stave <span epub:type="z3998:roman">II</span></a>
                        </li>
                        <li>
!                           <a href="text/chapter-3.xhtml">Stave <span epub:type="z3998:roman">III</span></a>
                        </li>
                        <li>
!                           <a href="text/chapter-4.xhtml">Stave <span epub:type="z3998:roman">IV</span></a>
                        </li>
                        <li>
!                           <a href="text/chapter-5.xhtml">Stave <span epub:type="z3998:roman">V</span></a>
                        </li>
                    </ol>
                </li>

Which is correct? (Exclamation marks above are from the bash 'diff' command showing where there are differences).

drgrigg commented 5 years ago

Please ignore the last post. It's pretty clear from a strict application of your rules that the "Staves" shouldn't include subtitles in the ToC, though perhaps we can allow an exception for this one work?

drgrigg commented 5 years ago

Hmmm, I'm finding that generally in the corpus I'm now getting a LOT of mismatches where I say there should be a subtitle for a part or division and the existing ToC doesn't have them. However, I'm comforted by the fact that in the couple of cases I've spot-checked, lint also complains about the lack of subtitles for these items in the ToC.

Eg, https://github.com/standardebooks/jules-verne_the-mysterious-island.git.

Guess once we're confident about print-toc doing the right thing I can go back and substitute my own generated ToC in such cases.

acabal commented 5 years ago

Don't worry too much about the existing corpus; a lot of those entries are currently incorrect because I ran the tool over them before I realized the mistake.

Now that I look at the comments in lint, I see it doesn't exactly implement my note above. For now that's OK. Let's follow what lint does for now and we can tweak things later if need be.

For the staves, that is one of those unusual exceptions where we would simply edit it by hand and ignore lint warnings. They're essentially chapters, but purposely titled "staves" to keep with the musical theme of the work. Therefore we want to keep them and treat them as chapters, and thus keep them in the ToC, but that doesn't fit into the strict rules/lint template, and that's OK.

drgrigg commented 5 years ago

Sure. Print-toc is always going to need some manual tweaking for particular cases. Poetry is another example where the ToC will generally have to be adjusted by hand.

I'm having fun doing all this, by the way. Really enjoying this process, and I keep on finding ways to improve the code in various ways, or find out from your own code bits of Python or Beautiful Soup which I hadn't twigged to before : soup.get("attr"), for example! Doh!

acabal commented 5 years ago

Fixed in #152