Open dreeve4958 opened 2 years ago
Info can contain the following elements: abstract, address, annotation, artpagenums, author, authorgroup, authorinitials, bibliocoverage, biblioid, bibliomisc, bibliomset, bibliorelation, biblioset, bibliosource, collab, confgroup, contractnum, contractsponsor, copyright, cover, date, edition, editor, extendedlink, issuenum, itermset, keywordset, legalnotice, mediaobject, org, orgname, othercredit, pagenums, printhistory, productname, productnumber, pubdate, publisher, publishername, releaseinfo, revhistory, seriesvolnums, subjectset, subtitle, title, titleabbrev, volumenum. We could try to handle each of these (or anyway more of them). Or we could try to construct some kind of generic method fro handling all the children of info that we don't handle already. I haven't looked at all of these to see how diverse they are.
Maybe best if you specify the specific elements you find essential for your purposes.
Different publishers adopt different house styles for collating this data across pages in a book's front matter. I wonder if the most general solution would be some way of specifying one or more collections of front-matter elements where each collection gets treated like a new-page delimited formatted group. A command line option could then specify the elements in each collection, something like:
--front-matter=title,authorgroup,pubdate,copyright
supplying a second --front-matter options defines a second collection and so on.
new-page delimited formatted group
That's presentation, and it would be the concern of a writer, not a reader module. In the parsing phase we're concerned with structure. In this case, what fields should be added to the metadata.
The problem with trying for a general solution is that each of these elements of info has its own distinct structure.
Ah, been hit by the silent dropping of abstract
in the Alex docbook conversion:
The abstract seems to be generally ignored, I tried rst
, html
, latex
targets. Here is a MWE:
<?xml version="1.0"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
<book id="abstract">
<bookinfo>
<date>2001-4-27</date>
<title>Happy User Guide</title>
<author>
<firstname>Simon</firstname>
<surname>Marlow</surname>
</author>
<address><email>simonmar@microsoft.com</email></address>
<copyright>
<year>1997-2009</year>
<holder>Simon Marlow</holder>
</copyright>
<abstract>
<para>This document describes Happy, the Haskell Parser
Generator, version 1.18.</para>
</abstract>
</bookinfo>
</book>
I think if the DocBook Reader drops stuff it does not understand, it should emit a suitable warning. E.g.:
<file>:<line>:<column>: dropping content of 'copyright' tag
<file>:<line>:<column>: dropping content of 'abstract' tag
Note: the reader does warn about skipped elements, but only if --verbose
is enabled.
I tried 326a00ab1a0ff7cd1afff4d6a3417a780d9bdf40 on my MWE but the rst output has not changed (abstract still missing):
$ pandoc -s -f docbook -t rst pandoc-bug-abstract.xml
================
Happy User Guide
================
:Author: Simon Marlow
:Date: 2001-4-27
(But maybe this is WIP and not supposed to work yet.)
That's another issue. The abstract is getting into the pandoc AST, but the RST writer doesn't currently do anything with that. You can probably add it by tweaking the default template.
I'll push a change to the RST template that will add the standard RST bibliographic fields.
Issue applies to all pandoc versions across all platforms.
Pandoc currently cherry-picks DocBook (5.0) or (4.5) elements, retrieving just enough data (importantly title and author) to allow it to produce a rough rendering of a book-like document. Given what a good job pandoc does of rendering the body of a DocBook it is shame that the equally important meta material that allows assembly of a book's front matter is neglected.
Processing a larger or complete subset of metadata into coherent front matter would allow the rendering of something approaching a publishable book. In this context the element supplies data that will normally be used to produce a book's edition notice / colophon / impressum / dedication pages. Different rendering models tackle this task in different ways and there is no reason why pandoc shouldn't adopt it's own model and behaviour.
Without these elements, the output of a book in pandoc from a DocBook source is only a poor approximation of the sort of the sort of artefact that one is used to seeing as a "real" publication.
DJR