jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.39k stars 3.31k forks source link

Conversions from Google Docs (`.odt`) do not preserve title and subtitle metadata, headings converted to `<p>`-tags. #8924

Open StephanMeijer opened 1 year ago

StephanMeijer commented 1 year ago

Pandoc versions

Pandoc 3.1.2 under macOs (M1)

pandoc 3.1.2
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /Users/steve/.local/share/pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
pandoc 3.1.4
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /Users/steve/.local/share/pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.

Examples and reproduction

Previously we created documents using Google Docs. We are using some filters and templating which you can see here.

We are running it with:

pandoc -s --quiet --from=docx --to=html \
    --output=test/<test-dir>/none/expected.html --template=src/template.html \
    test/<test-dir>/input.docx

Examples

Example input Example output (using filters as defined above)
google-docs-title-subtitle-headings-image-lists-odt google-docs-title-subtitle-headings-image-lists-odt/none/expected.html

Explanation on examples

As you can see, the title and subtitle are not respected in the output format but rendered as a paragraph instead of in the head. Examples of how it should be rendered can be found here (these examples use a .docx-file as source, not a .odt):

Issues:

StephanMeijer commented 1 year ago

Also tested on:

pandoc 3.1.4
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /Users/steve/.local/share/pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
jgm commented 1 year ago

When the Title style is applied in Google docs, the exported odt has this:

      <text:p text:style-name="P2">
      <text:bookmark text:name="_zcuel4vwl53z" />The title</text:p>

With

    <style:style style:name="P2" style:family="paragraph"
    style:parent-style-name="Title"
    style:master-page-name="Standard">
      <style:paragraph-properties style:page-number="1" />
    </style:style>

So perhaps the reader could determine that it's a title by looking up the style P2 and then getting parent-style-name attribute.

StephanMeijer commented 1 year ago

Is this an easy fix? If so, I might be able to plan with my team to fix this ourselves.

jgm commented 1 year ago

I don't know how easy it would be. I didn't write the ODT reader, and I'm not too familiar with it.

jgm commented 1 year ago

You'd want to look at constructPara in Readers/ODT/ContentReader, I think. And you'd need a function that gets the parent style name, not just the style name as present.

When a title was recognized, you'd need to insert it into metadata instead of the document.

The ODT reader is written in a strange style, with arrows, so it's a bit unusual.

StephanMeijer commented 1 year ago

@jgm I improved the description on this ticket with also reproductional information.

StephanMeijer commented 1 year ago

@jgm updated description with examples using no filters