jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.19k stars 3.36k forks source link

Docx reader: metadata recognition is blocked if other elements come before title #8986

Open StephanMeijer opened 1 year ago

StephanMeijer commented 1 year ago

Explain the problem.

How does the .docx reader in Pandoc determine the top style, such as Title, and what implications does this approach have for international documents? Specifically, in Dutch (NL) documents, the top style for Title is often named Title but has a Style ID of Titel (the Dutch translation for Title).

I believe this might result in the Title being converted to a regular paragraph.

Pandoc version?

pandoc 3.1.6
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /Users/steve/.local/share/pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.

Example

This example has been anonymized and therefore contains gibberish.

Visual representation of document in Microsoft Word image

Expected

Expected "Znzxar txfnfdcestx turpfmdrhpff" to be marked as the Title of this document as seen in the screenshot above.

Actual

Text "Znzxar txfnfdcestx turpfmdrhpff" is marked as regular text paragraph, not the Title of the document.

Sources:

Input: input.docx

Output: none/expected.html


Our internal ID: NLDOC-837

StephanMeijer commented 1 year ago

I'm currently unable to provide examples but I think this HeadingInPairs in docProps/app.xml attribute might have something to do with this..

image
StephanMeijer commented 1 year ago

ChatGPT guesses this Word-file to be created with Microsoft Word 2016

image
StephanMeijer commented 1 year ago

Example added.

jgm commented 1 year ago

You are right: we just look at the style.

metaStyles :: M.Map ParaStyleName T.Text
metaStyles = M.fromList [ ("Title", "title")
                        , ("Subtitle", "subtitle")
                        , ("Author", "author")
                        , ("Date", "date")
                        , ("Abstract", "abstract")]

Paragraphs with these styles turn into metadata values. I'm not sure how to deal with the full variety of style names in other languages.

jgm commented 1 year ago

I would have assumed that the style ID would stay the same in localizations, while the style name changes, but you are reporting the reverse. It would be good to have more information here from others using localized versions of Word.

EDIT: Also, the style names above are compared against style names, not ids, so it should work if your style name is really "Title".

jgm commented 1 year ago

OK, this doesn't have anything to do with the style ID or with the language.

Pandoc looks for "metadata" paragraphs only at the beginning of the document. Since your document begins with another element (an image of a cat), the paragraph is not treated as metadata. Removing the cat picture causes the text to be treated as a title.

StephanMeijer commented 1 year ago

@jgm Is this an issue that ought to be resolved in Pandoc itself, or is it better to so some preprocessing on our side?

This anonymized example is based on a real-life document we got. So I assume it might be best to fix it in Pandoc? Or is the setup like this invalid per standards?

jgm commented 1 year ago

The pandoc behavior is intentional, but it could be changed. The current approach is conservative: we don't want to pick up a style "Date" that occurs deep in the body of the document as the metadata data... Changing it might produce some unexpected effects.

StephanMeijer commented 1 year ago

Shouldn't specific styles always be considered 'metadata', such as "Title" or "Subtitle"?

jgm commented 1 year ago

Who knows? Word has a Date style. Is it intended for the document date only? Or is it something one might use for other dates in the body of the document? In fact, a user could use it either way. If they did the latter we'd be picking up dates from the body of the document and treating them as metadata.

I'm tempted to change things so that these styles are always considered metadata, even if they don't come at the beginning, but I'm also resisting the temptation, because it might have bad results -- and these would only become evident after the change was made. I think it was probably done this way for a reason.

StephanMeijer commented 1 year ago

Could you give me examples of metadata that are not metadata in different contexts?

jgm commented 1 year ago

See my previous comment on Date. Do I have a real-world example? No. I try to deal with Word documents as little as possible. But as I said: anyone can apply the Date style anywhere in the document they wish! So, maybe there are lots of documents where the Date style is not used for metadata. That would be my guess, anyway.

StephanMeijer commented 1 year ago

I will try to compose some examples and strategies for extracting certain parts of metadata.