jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.03k stars 3.35k forks source link

Problem with missing TOC for <h1> preceded by <div></div> #8996

Open andrzejQ opened 1 year ago

andrzejQ commented 1 year ago

Problem with missing TOC for <h1> preceded by <div></div>.

When <div></div> is removed, the TOC is created, although the link <h2>-- h2 --</h2> does not work properly.

Example - inp.html:

<!doctype html><html lang="en">
<head>  <meta charset="utf-8"><title>HTML5 Template</title></head>
<body>

<div>
          <div></div> <!-- TOC is missing -->
<div>
  <h1>-- h1 --</h1>
    <p>abcd</p>
    <h2>-- h2 --</h2>
      <p>efgh</p>
</div>

</div>

<div>
  <h1>-- last h1 --</h1>
    <h2>-- last h2 --</h2>
      <p>ijkl</p>
</div>

</body>
</html>

exact command line pandoc 3.1.6.1, Windows 11 cmd: pandoc -f html -t epub3 --toc -o book.epub inp.html

andrzejQ commented 1 year ago

The problem started with the pandoc v3 version.

Canuck317 commented 1 year ago

This may be similar to an issue that I've been having with converting from a list of Markdown files to epub3 in pandoc v3 and up. The ebook is generated as before, but none of the links in the TOC work. Digging into the generated epubs, there are two differences:

First, prior to v3 each markdown file had its own xhtml file in the epub. After v3, all the text is in a single ch001.xhtml file.

Second, prior to v3 the name of the xhmtl file was part of the link in the TOC. With v3 it appears the filename is missing from the link, which is why the link doesn't work. Manually inserting 'ch001.xhtml' before the section tag fixes the link.

v3+

TOC link fails: <a href="text/#c__traks__ebooks__tossed__tossed_2.md__antares-or-bust">2 - Antares or Bust</a>

TOC link fixed: <a href="text/ch001.xhtml#c__traks__ebooks__tossed__tossed_2.md__antares-or-bust">2 - Antares or Bust</a>

Section ID:<div id="c__traks__ebooks__tossed__tossed_2.md"><section id="c__traks__ebooks__tossed__tossed_2.md__antares-or-bust" class="level1">

v2.19 (works)

TOC link: <a href="text/ch003.xhtml#antares-or-bust">2 - Antares or Bust</a>

Section ID <section id="antares-or-bust" class="level1" data-number="3">

v3 fails

v2.19 works

jgm commented 1 year ago

@Canuck317

First, prior to v3 each markdown file had its own xhtml file in the epub. After v3, all the text is in a single ch001.xhtml file.

This is probably the same issue: In pandoc 3.0 we introduced a new module Text.Pandoc.Chunks with functions splitIntoChunks and toTOCTree, which we now use in both EPUB and chunked HTML generation. This may have changed behavior in some cases with divs.

Second, prior to v3 the name of the xhmtl file was part of the link in the TOC. With v3 it appears the filename is missing from the link, which is why the link doesn't work. Manually inserting 'ch001.xhtml' before the section tag fixes the link.

This sounds like a separate problem; please report it in a separate issue with a small reproducible example. No matter how the document is segmented into sections or chapters, the links in the TOC should point to the right place.

jgm commented 1 year ago

Here is the intermediate structure produced by splitIntoChunks from the input above:

ChunkedDoc
  { chunkedMeta =
      Meta
        { unMeta =
            fromList
              [ ( "lang" , MetaInlines [ Str "en" ] )
              , ( "title"
                , MetaInlines [ Str "HTML5" , Space , Str "Template" ]
                )
              ]
        }
  , chunkedTOC =
      Node
        { rootLabel =
            SecInfo
              { secTitle = []
              , secNumber = Nothing
              , secId = ""
              , secPath = "#"
              , secLevel = 0
              }
        , subForest =
            [ Node
                { rootLabel =
                    SecInfo
                      { secTitle = [ Str "HTML5" , Space , Str "Template" ]
                      , secNumber = Nothing
                      , secId = "html5-template"
                      , secPath = "ch001.xhtml"
                      , secLevel = 1
                      }
                , subForest = []
                }
            , Node
                { rootLabel =
                    SecInfo
                      { secTitle =
                          [ Str "--"
                          , Space
                          , Str "last"
                          , Space
                          , Str "h1"
                          , Space
                          , Str "--"
                          ]
                      , secNumber = Nothing
                      , secId = "last-h1---"
                      , secPath = "ch002.xhtml"
                      , secLevel = 1
                      }
                , subForest =
                    [ Node
                        { rootLabel =
                            SecInfo
                              { secTitle =
                                  [ Str "--"
                                  , Space
                                  , Str "last"
                                  , Space
                                  , Str "h2"
                                  , Space
                                  , Str "--"
                                  ]
                              , secNumber = Nothing
                              , secId = "last-h2---"
                              , secPath = "ch002.xhtml#last-h2---"
                              , secLevel = 2
                              }
                        , subForest = []
                        }
                    ]
                }
            ]
        }
  , chunkedChunks =
      [ Chunk
          { chunkHeading = [ Str "HTML5" , Space , Str "Template" ]
          , chunkId = "html5-template"
          , chunkLevel = 1
          , chunkNumber = 1
          , chunkSectionNumber = Nothing
          , chunkPath = "ch001.xhtml"
          , chunkUp = Nothing
          , chunkPrev = Nothing
          , chunkNext =
              Just
                Chunk
                  { chunkHeading =
                      [ Str "--"
                      , Space
                      , Str "last"
                      , Space
                      , Str "h1"
                      , Space
                      , Str "--"
                      ]
                  , chunkId = "last-h1---"
                  , chunkLevel = 1
                  , chunkNumber = 2
                  , chunkSectionNumber = Nothing
                  , chunkPath = "ch002.xhtml"
                  , chunkUp = Nothing
                  , chunkPrev =
                      Just
                        Chunk
                          { chunkHeading = [ Str "HTML5" , Space , Str "Template" ]
                          , chunkId = "html5-template"
                          , chunkLevel = 1
                          , chunkNumber = 1
                          , chunkSectionNumber = Nothing
                          , chunkPath = "ch001.xhtml"
                          , chunkUp = Nothing
                          , chunkPrev = Nothing
                          , chunkNext = Nothing
                          , chunkUnlisted = False
                          , chunkContents =
                              [ Div
                                  ( "html5-template" , [ "section" , "unnumbered" ] , [] )
                                  [ Header
                                      1
                                      ( "" , [ "unnumbered" ] , [] )
                                      [ Str "HTML5" , Space , Str "Template" ]
                                  , Div
                                      ( "" , [] , [] )
                                      [ Div ( "" , [] , [] ) []
                                      , Div
                                          ( "h1---" , [ "section" ] , [] )
                                          [ Header
                                              1
                                              ( "" , [] , [] )
                                              [ Str "--" , Space , Str "h1" , Space , Str "--" ]
                                          , Para [ Str "abcd" ]
                                          , Div
                                              ( "h2---" , [ "section" ] , [] )
                                              [ Header
                                                  2
                                                  ( "" , [] , [] )
                                                  [ Str "--" , Space , Str "h2" , Space , Str "--" ]
                                              , Para [ Str "efgh" ]
                                              ]
                                          ]
                                      ]
                                  ]
                              ]
                          }
                  , chunkNext = Nothing
                  , chunkUnlisted = False
                  , chunkContents =
                      [ Div
                          ( "last-h1---" , [ "section" ] , [] )
                          [ Header
                              1
                              ( "" , [] , [] )
                              [ Str "--"
                              , Space
                              , Str "last"
                              , Space
                              , Str "h1"
                              , Space
                              , Str "--"
                              ]
                          , Div
                              ( "last-h2---" , [ "section" ] , [] )
                              [ Header
                                  2
                                  ( "" , [] , [] )
                                  [ Str "--"
                                  , Space
                                  , Str "last"
                                  , Space
                                  , Str "h2"
                                  , Space
                                  , Str "--"
                                  ]
                              , Para [ Str "ijkl" ]
                              ]
                          ]
                      ]
                  }
          , chunkUnlisted = False
          , chunkContents =
              [ Div
                  ( "html5-template" , [ "section" , "unnumbered" ] , [] )
                  [ Header
                      1
                      ( "" , [ "unnumbered" ] , [] )
                      [ Str "HTML5" , Space , Str "Template" ]
                  , Div
                      ( "" , [] , [] )
                      [ Div ( "" , [] , [] ) []
                      , Div
                          ( "h1---" , [ "section" ] , [] )
                          [ Header
                              1
                              ( "" , [] , [] )
                              [ Str "--" , Space , Str "h1" , Space , Str "--" ]
                          , Para [ Str "abcd" ]
                          , Div
                              ( "h2---" , [ "section" ] , [] )
                              [ Header
                                  2
                                  ( "" , [] , [] )
                                  [ Str "--" , Space , Str "h2" , Space , Str "--" ]
                              , Para [ Str "efgh" ]
                              ]
                          ]
                      ]
                  ]
              ]
          }
      , Chunk
          { chunkHeading =
              [ Str "--"
              , Space
              , Str "last"
              , Space
              , Str "h1"
              , Space
              , Str "--"
              ]
          , chunkId = "last-h1---"
          , chunkLevel = 1
          , chunkNumber = 2
          , chunkSectionNumber = Nothing
          , chunkPath = "ch002.xhtml"
          , chunkUp = Nothing
          , chunkPrev =
              Just
                Chunk
                  { chunkHeading = [ Str "HTML5" , Space , Str "Template" ]
                  , chunkId = "html5-template"
                  , chunkLevel = 1
                  , chunkNumber = 1
                  , chunkSectionNumber = Nothing
                  , chunkPath = "ch001.xhtml"
                  , chunkUp = Nothing
                  , chunkPrev = Nothing
                  , chunkNext = Nothing
                  , chunkUnlisted = False
                  , chunkContents =
                      [ Div
                          ( "html5-template" , [ "section" , "unnumbered" ] , [] )
                          [ Header
                              1
                              ( "" , [ "unnumbered" ] , [] )
                              [ Str "HTML5" , Space , Str "Template" ]
                          , Div
                              ( "" , [] , [] )
                              [ Div ( "" , [] , [] ) []
                              , Div
                                  ( "h1---" , [ "section" ] , [] )
                                  [ Header
                                      1
                                      ( "" , [] , [] )
                                      [ Str "--" , Space , Str "h1" , Space , Str "--" ]
                                  , Para [ Str "abcd" ]
                                  , Div
                                      ( "h2---" , [ "section" ] , [] )
                                      [ Header
                                          2
                                          ( "" , [] , [] )
                                          [ Str "--" , Space , Str "h2" , Space , Str "--" ]
                                      , Para [ Str "efgh" ]
                                      ]
                                  ]
                              ]
                          ]
                      ]
                  }
          , chunkNext = Nothing
          , chunkUnlisted = False
          , chunkContents =
              [ Div
                  ( "last-h1---" , [ "section" ] , [] )
                  [ Header
                      1
                      ( "" , [] , [] )
                      [ Str "--"
                      , Space
                      , Str "last"
                      , Space
                      , Str "h1"
                      , Space
                      , Str "--"
                      ]
                  , Div
                      ( "last-h2---" , [ "section" ] , [] )
                      [ Header
                          2
                          ( "" , [] , [] )
                          [ Str "--"
                          , Space
                          , Str "last"
                          , Space
                          , Str "h2"
                          , Space
                          , Str "--"
                          ]
                      , Para [ Str "ijkl" ]
                      ]
                  ]
              ]
          }
      ]
  }
jgm commented 1 year ago

In general, when pandoc tries to split a document into sections, it treats divs as opaque blobs. (Often this is just what you want, e.g. if the div is a callout.) The exception is when you have the structure

Div
  Header

You have this structure for "-- last h1 --" but not for "-- h1 --", because the initial empty div interferes with it. Hope that explains the behavior.

jgm commented 1 year ago

I don't think this is a bug. It may well be a behavior change from versions < 3, but I don't see anything that should be changed. (There are changes that would restore your expected output for this input, but they would likely break other things.)

andrzejQ commented 1 year ago

Is version 3 used to convert specially prepared for pandoc html source and not any html sources from the Internet, like version 2?

jgm commented 1 year ago

Pandoc can be used to convert any HTML source. But it needs to use some heuristic in splitting documents into sections.

andrzejQ commented 1 year ago

The HTML example in the input above is a true HTML page reduced to a minimum. <div></div> is the content above the title <h1>. To be ignored in TOC.

And once <h2>-- h2 --</h2> appears in the TOC, it should link to <h2> (or <div><h2>) rather than to the section above <h1>.

That's why I made this report.

jgm commented 1 year ago

Yes -- I do understand that it's not working in the way you'd like in this case!

And there are possible changes in pandoc that would make this case work the way you expect -- at the expense of making other cases break. Heuristics are never 100% accurate.

For all pandoc knows, maybe the first div is meant to be a sidebar or something else that is not in the main structure?

andrzejQ commented 1 year ago

I guess the same problem can be observed for pandoc -f html -t epub3 --toc -o book.epub https://github.com/jgm/pandoc with v.2.19.2 versus v.3.1.6.1

andrzejQ commented 1 year ago

Is it possible to make the parts of <div></div> that do not contain <hN> recursively be ignored before the TOC evaluation starts?

jgm commented 1 year ago

@andrzejQ yes, I think that should be possible. I'll look into it.

The main problem isn't the presence of hN elements inside the inner div, but the fact that the outer div doesn't begin with an hN. (Any other element, such as a <p>, between the <div> and the header would be equally bad, because it means that we can't regard this div as a single section -- sections must begin with a title.)

andrzejQ commented 1 year ago

I wonder if it is not worth putting the task of improving pandoc v3: epub TOC for https://pandoc.org/MANUAL.html version 3 similar to TOC version 2.

pandoc -f html -t epub3 --toc -o man.epub https://pandoc.org/MANUAL.html

Currently the result is:

man.2.19.2.epub TOC:

Pandoc - Pandoc User’s Guide
Pandoc User’s Guide
Synopsis
Description
   Using pandoc
   Specifying formats
   Character encoding
   Creating a PDF
...

man.3.1.6.2.epub TOC:

Pandoc - Pandoc User’s Guide
jgm commented 1 year ago

The structure there is a more common one:

Div container
  Div main
    Header 1 etc.
  Div sidebar
    etc.
  Div search results

I agree, it would be nice to make this work better.

One thought: in general, it seems to me that in this kind of conversion you really only want the contents of <main> -- everything else is a distraction that doesn't belong in the ebook. So, offering an option to focus parsing on <main> might be a godo improvement.

andrzejQ commented 1 year ago

an option to focus parsing on <main> might be a godd improvement.

Yes it's true. Although TOC is rather not needed for a single website.

Let's take this scenario:

We want to transfer a weekly/monthly www magazine with several dozen large articles to Kindle. Using a web browser plug-in (which also works without any problems after more or less standard logging into the journal), such as WebScrapBook, we download all articles. (Such a www plugin allows us to limit the content to <main>.) We merge articles into one HTML file using some script or this plugin. We convert to EPUB using pandoc. Pandoc v2 performs very well in this case.

It's probably worth keeping <div> from disturbing the TOC in pandoc v3.