jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.68k stars 3.39k forks source link

html to fb2 omits <h2> elements #8123

Closed phil294 closed 2 years ago

phil294 commented 2 years ago

Explain the problem. Try some Wikipedia article:

pandoc -o rock.fb2 https://en.wikipedia.org/wiki/Rock_castle

and view the fb2 file: The subheaders (e.g. "Rock-hewn castles") are missing.

Pandoc version? 2.17.1.1 on Manjaro (Arch) Linux

Side notes:

Another problem: Image captions are missing but since they don't follow semantics standards and are plain <div> text elements, this is Wikipedia to be blamed.

Finally, it would be nice if <img> width and height attributes were respected, or maybe via css somehow?

Locally, I solved 1. and 2. via regex hacking and 3. with imagemagick cmd line tools before converting.

jgm commented 2 years ago

@astanin - can you see what is happening here?

astanin commented 2 years ago

@jgm I'm not using this feature anymore but I suppose that the problem is how Wikipedia HTML is parsed.

FB2 does not have an equivalent of the h2 tag. It provides only <title> within a <section>, but it allows to have nested sections. I think it becomes more clear if we look at the body of an FB2 document (source):

  <body>
    <section>
      <p>Frontispiece, with the caption: "He examined with his glass the word
        upon the wall, going over every letter of it with the most minute
        exactness." (<emphasis>Page</emphasis> 23.)</p>
    </section>
    <section>
      <title><p>PART I.</p></title>
      <section>
        <p>(<emphasis>Being a reprint from the reminiscences of</emphasis> JOHN
          H. WATSON, M.D.,<emphasis> late of the Army Medical
            Department.</emphasis>) <a xlink:href="#N2" type="note">2</a></p>
      </section>
      <section>
        <title><p>CHAPTER I. MR. SHERLOCK HOLMES.</p></title>
        <p>IN the year 1878 I took my degree of Doctor of Medicine of the
          University of London, and proceeded to Netley to go through the
          course prescribed for surgeons in the army. Having completed my
          studies there, I was duly attached to the Fifth Northumberland
          Fusiliers as Assistant Surgeon. The regiment was stationed in India
          at the time, and before I could join it, the second Afghan war had
          broken out. On landing at Bombay, I learned that my corps had
          advanced through the passes, and was already deep in the enemy's
          country. I followed, however, with many other officers who were in
          the same situation as myself, and succeeded in reaching Candahar in
          safety, where I found my regiment, and at once entered upon my new
          duties.</p>
        <p>The campaign brought honours and promotion to many, but for me it
          had nothing but misfortune and disaster. I was removed from my
          brigade and attached to the Berkshires, with whom I served at the
          fatal battle of Maiwand. There I was struck on the shoulder by a
          Jezail bullet, which shattered the bone and grazed the subclavian
          artery. I should have fallen into the hands of the murderous Ghazis
          had it not been for the devotion and courage shown by Murray, my
          orderly, who threw me across a pack-horse, and succeeded in bringing
          me safely to the British lines.</p>
      </section>
    </section>
  </body>

FB2 Writer apparently assumes that a section is represented by a Div Block with a section attribute and Header as the first nested Block. As long as HTML can be parsed to this structure, FB2 converter should work fine.

From what I can see from the native format output, Wikipedia HTML is parsed differently. Looking at the beginning of one of the sections:

    ,Div ("",["thumb","tright"],[])
     [Div ("",["thumbinner"],[("style","width:222px;")])
      [Plain [Link ("",["image"],[]) [Image ("",["thumbimage"],[("width","220"),("height","165"),("srcset","//upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Burg_Rotenhan.jpg/330px-Burg_Rotenhan.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Burg_Rotenhan.jpg/440px-Burg_Rotenhan.jpg 2x")]) [] ("//upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Burg_Rotenhan.jpg/220px-Burg_Rotenhan.jpg","")] ("/wiki/File:Burg_Rotenhan.jpg","")]
      ,Div ("",["thumbcaption"],[])
       [Div ("",["magnify"],[])
        [Plain [Link ("",["internal"],[]) [] ("/wiki/File:Burg_Rotenhan.jpg","Enlarge")]]
       ,Plain [Str "The",Space,Str "gateway",Space,Str "to",Space,Link ("",[],[]) [Str "Rotenhan",Space,Str "Castle"] ("/wiki/Rotenhan_Castle","Rotenhan Castle"),Str ",",Space,Str "which",Space,Str "was",Space,Str "entirely",Space,Str "hewn",Space,Str "out",Space,Str "of",Space,Str "the",Space,Str "sandstone"]]]]
    ,Header 2 ("rock-hewn-castlesedit",[],[]) [Span ("Rock-hewn_castles",["mw-headline"],[]) [Str "Rock-hewn",Space,Str "castles"],Span ("",["mw-editsection"],[]) [Span ("",["mw-editsection-bracket"],[]) [Str "["],Link ("",[],[]) [Str "edit"] ("/w/index.php?title=Rock_castle&action=edit&section=2","Edit section: Rock-hewn castles"),Span ("",["mw-editsection-bracket"],[]) [Str "]"]]]
    ,Para [Str "Castle",Space,Str "researcher",Space,Link ("",[],[]) [Str "Otto",Space,Str "Piper"] ("/wiki/Otto_Piper","Otto Piper"),Space,Str "used",Space,Str "the",Space,Str "German",Space,Str "phrase",Space,Emph [Str "ausgehauene",Space,Str "Burg"],Space,Str "(literally:",Space,Str "\"carved-out",Space,Str "castle\")",Space,Str "for",Space,Str "castles",Space,Str "that",Space,Str "had",Space,Str "rooms",Space,Str "artificially",Space,Str "hewn",Space,Str "out",Space,Str "of",Space,Str "the",Space,Str "rock",Space,Str "on",Space,Str "which",Space,Str "the",Space,Str "castle",Space,Str "stood.",Superscript [Link ("",[],[]) [Str "[1]"] ("#cite_note-1","")],Space,Str "His",Space,Str "examples",Space,Str "of",Space,Str "such",Space,Str "rock-hewn",Space,Str "castles",Space,Str "include",Space,Link ("",["mw-redirect"],[]) [Str "Fleckenstein"] ("/wiki/Fleckenstein_Castle","Fleckenstein Castle"),Str ",",Space,Link ("",[],[]) [Str "Trifels"] ("/wiki/Trifels_Castle","Trifels Castle"),Space,Str "and",Space,Link ("",["mw-redirect"],[]) [Str "Altwindstein"] ("/wiki/Altwindstein_Castle","Altwindstein Castle"),Str ".",Space,Str "From",Space,Str "a",Space,Str "constructional",Space,Str "point",Space,Str "of",Space,Str "view",Space,Str "there",Space,Str "is",Space,Str "a",Space,Str "close",Space,Str "relationship",Space,Str "with",Space,Link ("",[],[]) [Str "cave",Space,Str "castles"] ("/wiki/Cave_castle","Cave castle"),Str ",",Space,Str "which",Space,Str "are",Space,Str "also",Space,Str "often",Space,Str "enhanced",Space,Str "with",Space,Str "rooms",Space,Str "artificially",Space,Str "cut",Space,Str "out",Space,Str "of",Space,Str "the",Space,Str "rock."]

So it appears that Header is not even inside a Div with the "section" attribute.

As a workaround I would suggest to click on the page Edit link in Wikipedia, copy the mediawiki markup to a file, and try to convert that file instead of the rendered HTML. As a more permanent solution, the assumption about how a section is represented may have to be revised.

jgm commented 2 years ago

@astanin I don't think that's the heart of it. The FB2 writer has never expected the AST to be structured into sections. It starts by applying a function renderSections that converts a regular AST into this section Div structure.

tarleb commented 2 years ago

It seems that this happens whenever the content is wrapped in a div and doesn't start with a header. Example:

::: wrapper
hello

# MISSING

section one
:::

Output of pandoc -t fb2 for this Markdown, tidyed up for readability. Note that the first level heading is missing.

<?xml version="1.0" encoding="utf-8"?>
<FictionBook xmlns="http://www.gribuser.ru/xml/fictionbook/2.0"
xmlns:l="http://www.w3.org/1999/xlink">
  <description>
    <title-info>
      <genre>unrecognised</genre>
    </title-info>
    <document-info>
      <program-used>pandoc</program-used>
    </document-info>
  </description>
  <body>
    <title>
      <p />
    </title>
    <section>
      <p>hello</p>
      <p>section one</p>
    </section>
  </body>
</FictionBook>
jgm commented 2 years ago

I think what's going on is that makeSections just leaves this Div alone, so the expectation of the fb2 writer (which is that after makeSections every Div will have the structure described above) is incorrect?

Here's the result of putting a trace on the block structure produced by makeSections in the FB2 writer:

[ Div
    ( "" , [ "section" ] , [] )
    [ Header 1 ( "" , [] , [] ) []
    , Div
        ( "" , [ "wrapper" ] , [] )
        [ Para [ Str "hello" ]
        , Div
            ( "missing" , [ "section" ] , [] )
            [ Header 1 ( "" , [] , [] ) [ Str "MISSING" ]
            , Para [ Str "section" , Space , Str "one" ]
            ]
        ]
    ]
]
jgm commented 2 years ago

Pushed a potential fix, but I don't know enough about FB2 to know if this is right. Output for @tarleb's snippet would be

<?xml version="1.0" encoding="UTF-8"?>
<FictionBook xmlns="http://www.gribuser.ru/xml/fictionbook/2.0" xmlns:l="http://www.w3.org/1999/xlink"><description><title-info><genre>unrecognised</genre></title-info><document-info><program-used>pandoc</program-used></document-info></description><body><title><p /></title><section><p>hello</p><section id="missing"><title><p>MISSING</p></title><p>section one</p></section></section></body></FictionBook>
jgm commented 2 years ago

Is FB2 okay with a section element without a title element?

tarleb commented 2 years ago

The FB2 schema that I found says minOccurs="0" for titles in sections, so it seems that title-less sections are ok.

jgm commented 2 years ago

Closing this, then.