Closed phil294 closed 2 years ago
@astanin - can you see what is happening here?
@jgm I'm not using this feature anymore but I suppose that the problem is how Wikipedia HTML is parsed.
FB2 does not have an equivalent of the h2
tag. It provides only <title>
within a <section>
, but it allows to have nested sections. I think it becomes more clear if we look at the body of an FB2 document (source):
<body>
<section>
<p>Frontispiece, with the caption: "He examined with his glass the word
upon the wall, going over every letter of it with the most minute
exactness." (<emphasis>Page</emphasis> 23.)</p>
</section>
<section>
<title><p>PART I.</p></title>
<section>
<p>(<emphasis>Being a reprint from the reminiscences of</emphasis> JOHN
H. WATSON, M.D.,<emphasis> late of the Army Medical
Department.</emphasis>) <a xlink:href="#N2" type="note">2</a></p>
</section>
<section>
<title><p>CHAPTER I. MR. SHERLOCK HOLMES.</p></title>
<p>IN the year 1878 I took my degree of Doctor of Medicine of the
University of London, and proceeded to Netley to go through the
course prescribed for surgeons in the army. Having completed my
studies there, I was duly attached to the Fifth Northumberland
Fusiliers as Assistant Surgeon. The regiment was stationed in India
at the time, and before I could join it, the second Afghan war had
broken out. On landing at Bombay, I learned that my corps had
advanced through the passes, and was already deep in the enemy's
country. I followed, however, with many other officers who were in
the same situation as myself, and succeeded in reaching Candahar in
safety, where I found my regiment, and at once entered upon my new
duties.</p>
<p>The campaign brought honours and promotion to many, but for me it
had nothing but misfortune and disaster. I was removed from my
brigade and attached to the Berkshires, with whom I served at the
fatal battle of Maiwand. There I was struck on the shoulder by a
Jezail bullet, which shattered the bone and grazed the subclavian
artery. I should have fallen into the hands of the murderous Ghazis
had it not been for the devotion and courage shown by Murray, my
orderly, who threw me across a pack-horse, and succeeded in bringing
me safely to the British lines.</p>
</section>
</section>
</body>
FB2 Writer apparently assumes that a section is represented by a Div
Block with a section
attribute and Header
as the first nested Block. As long as HTML can be parsed to this structure, FB2 converter should work fine.
From what I can see from the native
format output, Wikipedia HTML is parsed differently. Looking at the beginning of one of the sections:
,Div ("",["thumb","tright"],[])
[Div ("",["thumbinner"],[("style","width:222px;")])
[Plain [Link ("",["image"],[]) [Image ("",["thumbimage"],[("width","220"),("height","165"),("srcset","//upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Burg_Rotenhan.jpg/330px-Burg_Rotenhan.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Burg_Rotenhan.jpg/440px-Burg_Rotenhan.jpg 2x")]) [] ("//upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Burg_Rotenhan.jpg/220px-Burg_Rotenhan.jpg","")] ("/wiki/File:Burg_Rotenhan.jpg","")]
,Div ("",["thumbcaption"],[])
[Div ("",["magnify"],[])
[Plain [Link ("",["internal"],[]) [] ("/wiki/File:Burg_Rotenhan.jpg","Enlarge")]]
,Plain [Str "The",Space,Str "gateway",Space,Str "to",Space,Link ("",[],[]) [Str "Rotenhan",Space,Str "Castle"] ("/wiki/Rotenhan_Castle","Rotenhan Castle"),Str ",",Space,Str "which",Space,Str "was",Space,Str "entirely",Space,Str "hewn",Space,Str "out",Space,Str "of",Space,Str "the",Space,Str "sandstone"]]]]
,Header 2 ("rock-hewn-castlesedit",[],[]) [Span ("Rock-hewn_castles",["mw-headline"],[]) [Str "Rock-hewn",Space,Str "castles"],Span ("",["mw-editsection"],[]) [Span ("",["mw-editsection-bracket"],[]) [Str "["],Link ("",[],[]) [Str "edit"] ("/w/index.php?title=Rock_castle&action=edit§ion=2","Edit section: Rock-hewn castles"),Span ("",["mw-editsection-bracket"],[]) [Str "]"]]]
,Para [Str "Castle",Space,Str "researcher",Space,Link ("",[],[]) [Str "Otto",Space,Str "Piper"] ("/wiki/Otto_Piper","Otto Piper"),Space,Str "used",Space,Str "the",Space,Str "German",Space,Str "phrase",Space,Emph [Str "ausgehauene",Space,Str "Burg"],Space,Str "(literally:",Space,Str "\"carved-out",Space,Str "castle\")",Space,Str "for",Space,Str "castles",Space,Str "that",Space,Str "had",Space,Str "rooms",Space,Str "artificially",Space,Str "hewn",Space,Str "out",Space,Str "of",Space,Str "the",Space,Str "rock",Space,Str "on",Space,Str "which",Space,Str "the",Space,Str "castle",Space,Str "stood.",Superscript [Link ("",[],[]) [Str "[1]"] ("#cite_note-1","")],Space,Str "His",Space,Str "examples",Space,Str "of",Space,Str "such",Space,Str "rock-hewn",Space,Str "castles",Space,Str "include",Space,Link ("",["mw-redirect"],[]) [Str "Fleckenstein"] ("/wiki/Fleckenstein_Castle","Fleckenstein Castle"),Str ",",Space,Link ("",[],[]) [Str "Trifels"] ("/wiki/Trifels_Castle","Trifels Castle"),Space,Str "and",Space,Link ("",["mw-redirect"],[]) [Str "Altwindstein"] ("/wiki/Altwindstein_Castle","Altwindstein Castle"),Str ".",Space,Str "From",Space,Str "a",Space,Str "constructional",Space,Str "point",Space,Str "of",Space,Str "view",Space,Str "there",Space,Str "is",Space,Str "a",Space,Str "close",Space,Str "relationship",Space,Str "with",Space,Link ("",[],[]) [Str "cave",Space,Str "castles"] ("/wiki/Cave_castle","Cave castle"),Str ",",Space,Str "which",Space,Str "are",Space,Str "also",Space,Str "often",Space,Str "enhanced",Space,Str "with",Space,Str "rooms",Space,Str "artificially",Space,Str "cut",Space,Str "out",Space,Str "of",Space,Str "the",Space,Str "rock."]
So it appears that Header
is not even inside a Div
with the "section"
attribute.
As a workaround I would suggest to click on the page Edit link in Wikipedia, copy the mediawiki markup to a file, and try to convert that file instead of the rendered HTML. As a more permanent solution, the assumption about how a section is represented may have to be revised.
@astanin I don't think that's the heart of it. The FB2 writer has never expected the AST to be structured into sections. It starts by applying a function renderSections
that converts a regular AST into this section Div structure.
It seems that this happens whenever the content is wrapped in a div and doesn't start with a header. Example:
::: wrapper
hello
# MISSING
section one
:::
Output of pandoc -t fb2
for this Markdown, tidy
ed up for readability. Note that the first level heading is missing.
<?xml version="1.0" encoding="utf-8"?>
<FictionBook xmlns="http://www.gribuser.ru/xml/fictionbook/2.0"
xmlns:l="http://www.w3.org/1999/xlink">
<description>
<title-info>
<genre>unrecognised</genre>
</title-info>
<document-info>
<program-used>pandoc</program-used>
</document-info>
</description>
<body>
<title>
<p />
</title>
<section>
<p>hello</p>
<p>section one</p>
</section>
</body>
</FictionBook>
I think what's going on is that makeSections
just leaves this Div alone, so the expectation of the fb2 writer (which is that after makeSections
every Div will have the structure described above) is incorrect?
Here's the result of putting a trace on the block structure produced by makeSections in the FB2 writer:
[ Div
( "" , [ "section" ] , [] )
[ Header 1 ( "" , [] , [] ) []
, Div
( "" , [ "wrapper" ] , [] )
[ Para [ Str "hello" ]
, Div
( "missing" , [ "section" ] , [] )
[ Header 1 ( "" , [] , [] ) [ Str "MISSING" ]
, Para [ Str "section" , Space , Str "one" ]
]
]
]
]
Pushed a potential fix, but I don't know enough about FB2 to know if this is right. Output for @tarleb's snippet would be
<?xml version="1.0" encoding="UTF-8"?>
<FictionBook xmlns="http://www.gribuser.ru/xml/fictionbook/2.0" xmlns:l="http://www.w3.org/1999/xlink"><description><title-info><genre>unrecognised</genre></title-info><document-info><program-used>pandoc</program-used></document-info></description><body><title><p /></title><section><p>hello</p><section id="missing"><title><p>MISSING</p></title><p>section one</p></section></section></body></FictionBook>
Is FB2 okay with a section element without a title element?
The FB2 schema that I found says minOccurs="0"
for titles in sections, so it seems that title-less sections are ok.
Closing this, then.
Explain the problem. Try some Wikipedia article:
and view the fb2 file: The subheaders (e.g. "Rock-hewn castles") are missing.
Pandoc version? 2.17.1.1 on Manjaro (Arch) Linux
Side notes:
Another problem: Image captions are missing but since they don't follow semantics standards and are plain
<div>
text elements, this is Wikipedia to be blamed.Finally, it would be nice if
<img>
width
andheight
attributes were respected, or maybe via css somehow?Locally, I solved 1. and 2. via regex hacking and 3. with imagemagick cmd line tools before converting.