jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.7k stars 3.39k forks source link

DocBook <info> processing is incomplete/inadequate #7747

Open dreeve4958 opened 2 years ago

dreeve4958 commented 2 years ago

Issue applies to all pandoc versions across all platforms.

Pandoc currently cherry-picks DocBook (5.0) or (4.5) elements, retrieving just enough data (importantly title and author) to allow it to produce a rough rendering of a book-like document. Given what a good job pandoc does of rendering the body of a DocBook it is shame that the equally important meta material that allows assembly of a book's front matter is neglected.

Processing a larger or complete subset of metadata into coherent front matter would allow the rendering of something approaching a publishable book. In this context the element supplies data that will normally be used to produce a book's edition notice / colophon / impressum / dedication pages. Different rendering models tackle this task in different ways and there is no reason why pandoc shouldn't adopt it's own model and behaviour.

Without these elements, the output of a book in pandoc from a DocBook source is only a poor approximation of the sort of the sort of artefact that one is used to seeing as a "real" publication.

DJR

jgm commented 2 years ago

Info can contain the following elements: abstract, address, annotation, artpagenums, author, authorgroup, authorinitials, bibliocoverage, biblioid, bibliomisc, bibliomset, bibliorelation, biblioset, bibliosource, collab, confgroup, contractnum, contractsponsor, copyright, cover, date, edition, editor, extendedlink, issuenum, itermset, keywordset, legalnotice, mediaobject, org, orgname, othercredit, pagenums, printhistory, productname, productnumber, pubdate, publisher, publishername, releaseinfo, revhistory, seriesvolnums, subjectset, subtitle, title, titleabbrev, volumenum. We could try to handle each of these (or anyway more of them). Or we could try to construct some kind of generic method fro handling all the children of info that we don't handle already. I haven't looked at all of these to see how diverse they are.

jgm commented 2 years ago

Maybe best if you specify the specific elements you find essential for your purposes.

dreeve4958 commented 2 years ago

Different publishers adopt different house styles for collating this data across pages in a book's front matter. I wonder if the most general solution would be some way of specifying one or more collections of front-matter elements where each collection gets treated like a new-page delimited formatted group. A command line option could then specify the elements in each collection, something like:

--front-matter=title,authorgroup,pubdate,copyright

supplying a second --front-matter options defines a second collection and so on.

jgm commented 2 years ago

new-page delimited formatted group

That's presentation, and it would be the concern of a writer, not a reader module. In the parsing phase we're concerned with structure. In this case, what fields should be added to the metadata.

The problem with trying for a general solution is that each of these elements of info has its own distinct structure.

andreasabel commented 2 years ago

Ah, been hit by the silent dropping of abstract in the Alex docbook conversion:

The abstract seems to be generally ignored, I tried rst, html, latex targets. Here is a MWE:

<?xml version="1.0"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
   "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">

<book id="abstract">
  <bookinfo>
    <date>2001-4-27</date>
    <title>Happy User Guide</title>
    <author>
      <firstname>Simon</firstname>
      <surname>Marlow</surname>
    </author>
    <address><email>simonmar@microsoft.com</email></address>
    <copyright>
      <year>1997-2009</year>
      <holder>Simon Marlow</holder>
    </copyright>
    <abstract>
      <para>This document describes Happy, the Haskell Parser
    Generator, version 1.18.</para>
    </abstract>
  </bookinfo>
</book>
andreasabel commented 2 years ago

I think if the DocBook Reader drops stuff it does not understand, it should emit a suitable warning. E.g.:

<file>:<line>:<column>: dropping content of 'copyright' tag
<file>:<line>:<column>: dropping content of 'abstract' tag
jgm commented 2 years ago

Note: the reader does warn about skipped elements, but only if --verbose is enabled.

andreasabel commented 2 years ago

I tried 326a00ab1a0ff7cd1afff4d6a3417a780d9bdf40 on my MWE but the rst output has not changed (abstract still missing):

$ pandoc -s -f docbook -t rst pandoc-bug-abstract.xml
================
Happy User Guide
================

:Author: Simon Marlow
:Date:   2001-4-27

(But maybe this is WIP and not supposed to work yet.)

jgm commented 2 years ago

That's another issue. The abstract is getting into the pandoc AST, but the RST writer doesn't currently do anything with that. You can probably add it by tweaking the default template.

jgm commented 2 years ago

I'll push a change to the RST template that will add the standard RST bibliographic fields.