jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.98k stars 3.34k forks source link

Remove TOC within docx before PDF generation #5810

Open kopax opened 4 years ago

kopax commented 4 years ago

I am using pandoc v1.7 and v2.7.3 to convert a .docx document into .pdf using --toc option.

.docx file was created and exported from Google Doc, you can download it: sample.docx wetransfer link.

image

v1.7 pdf generation

The PDF is having one TOC:

v2.7.3 pdf generation

The PDF is having two TOC:

image

Expected

I expect to have the same result as in pandoc v1.7

Possible solution

I am about to achieve this on my side but I'd like to know first if it can be restored somehow.

jgm commented 4 years ago

@jkr, I'd be curious what you think. The toc seems to be in a sdtContent element. http://www.datypic.com/sc/ooxml/e-w_sdtContent-1.html If this is the normal way autogenerated TOCs come, we could strip it out in the docx reader.

kopax commented 4 years ago

@jgm why not just <w:sdt> which wrapped the whole?

jgm commented 4 years ago

Actually we probably don't want to remove either sdt or sdtContent indiscriminantly. This appears to be used for things like equations, bibliographies, etc. http://www.datypic.com/sc/ooxml/e-w_sdtPr-1.html We should target the toc as specifically as possible.

jkr commented 4 years ago

Sorry to be so behind on this -- I definitely think it makes sense to remove generated TOC data. It shouldn't be particularly hard. It seems, from looking around the spec and a few docs I have, that we should be filtering on the docPartGallery. (sdt > sdtPr > docPartObj > docPartGallery). That matches with our output. But it might be worthwhile to collect a few more in-the-wild examples (many of the docs in my computer were produces by pandoc at some point in their history).

Or we could just filter on this one now, and then add other weird undocumented edge-cases as they pop up in practice.

jgm commented 4 years ago

Or we could just filter on this one now, and then add other weird undocumented edge-cases as they pop up in practice.

That makes sense. Maybe the poster of #5893 can post another sample to look at.

coryschires commented 4 years ago

Hi y'all. Thanks for looking into this issue!

Following up from #5893, I'm working with the journal editors to confirm they don't mind me providing their DOCX files as examples. Hopefully / probably they won't mind as these files are now published as Open Access articles but we'll see 🤷‍♂

I'll follow up as soon as I hear back – probably a few days.

PS: I'm no DOCXpert but, in my examples, it looks like there may be more than one way to generate TOC – so that's annoying / weird. Hopefully, fixing this issue isn't prohibitively complicated.

coryschires commented 4 years ago

Good news! I got the okay to share these DOCX files.

I have attached two DOCX files where the TOC is setup slightly differently. I'm not sure if these differences are technically significant but figured it would be best to provide both as they may be distinct use cases / problems.

Table of contents – not linked

In this doc, the TOC does not internally link to the headers (i.e. the TOC is not clickable). When I attempt to convert this DOCX to MD, the TOC is present in the MD, alongside the other front matter – even though I am not specifying the --toc flag. For example, my MD looks like:

toc_not_linked

9958_toc_not_linked.docx

Table of contents – linked

In this doc, the TOC does internally link to the headers (i.e. the TOC is clickable). When I attempt to convert this DOCX to MD, I have two problems:

  1. The TOC is present in the MD, alongside the other front matter (same as non-linked example)
  2. The document header include empty, unwanted span tags

Again, both these problems occur eve though I am not specifying the --toc flag.

toc_linked_1

Screen_Shot_2019-11-15_at_12_18_06_PM

10017_toc_linked.docx

Thanks again for looking into this! Hopefully these files help.

coryschires commented 4 years ago

Hi. Friendly ping to see if there's any progress on this issue.

ashsharma7 commented 4 years ago

I have solved this issue by doing this: 1) I generate TOC automatically in docx by using the menu option Insert->Table of Contents. 2) use -f docx+styles in pandoc command. 3) Use a panflute filter to filter out every div which has TOC in its custom-style attribute.

StephanMeijer commented 1 year ago

@ashsharma7 That's an interesting method.

For my project, I am thinking to pre-edit the .xml files in the .docx file (it's an archive) to filter out those styles, before conversion.

ashsharma7 commented 1 year ago

@StephanMeijer Filtering xml means using more python libraries. So its an overhead.

I tried filtering things in xml but word xml is not an easy and consistent format. Too many edge cases there. Keep me posted. 👍

StephanMeijer commented 1 year ago

@ashsharma7 Yeah it's a bit more overhead, but filtering it after conversion means more useless <div>'s.

StephanMeijer commented 1 year ago

@ashsharma7 You would be able to do it on the commandline with xmlstarlet after unzipping a DOCX.

Command would look like this:

$ xmlstarlet ed -L -d "//w:sdt[w:sdtPr/w:docPartObj/w:docPartGallery/@w:val='Table of Contents']" document.xml

This seem to work for properly formatted tables of contents for both Microsoft Word and Google Docs.

After that, you can zip it back.

Haven't tested if this corrupts the .docx but I suspect it won't.

ashsharma7 commented 1 year ago

@StephanMeijer Should work. My issue is that a lot of times, there is xml inserted in different places across a closing tag. For example some overlapping styles can trip this up (like adding a Bold style on last line of TOC and first line of the next bit of content). And its easy to get in a docx if you are copy-pasting content from one docx to another.

I try to avoid dealing in xml as much as possible. I did run a few ideas using python-docx library but ultimately abandoned that approach after I saw all that can be in these word xmls. YMMV.

StephanMeijer commented 1 year ago

For example some overlapping styles can trip this up (like adding a Bold style on last line of TOC and first line of the next bit of content).

@ashsharma7 Would you be so kind sharing some of these documents? It would be very interesting also for my specific use-case.

jgm commented 1 year ago

I'd be happy to exclude the generated TOC, but it doesn't look easy to recognize. Several paragraphs at the same level, one for the TOC title, then several more. What is the "signature" of these elements that we can count on across different versions, etc.? Looking for styles of the form TOCxx would work for the linked TOC document you uploaded -- except for the TOC title. But is that always present?

ashsharma7 commented 1 year ago

`from panflute import * PASS_THRU_CSS_STYLES = ['tag', 'hyperlink', 'footnote-reference', 'toc-1', 'toc-2', 'toc-3', 'toc-4', 'toc-5', 'toc-6', 'toc-7']

def process_custom_style_string(custom_style): return '-'.join(custom_style.strip().lower().split(' '))

def found_in_known_styles(custom_style_css):

we need to return True or False depending on whether we want docutils to handle it

# or custom via our css.
for item in PASS_THRU_CSS_STYLES:
    if 'tag' in custom_style_css:
        return True
return False

def handle_span_styles(elem, doc): if isinstance(elem, Span) and 'custom-style' in elem.attributes.keys(): custom_style = elem.attributes['custom-style'] role_tag = process_custom_style_string(custom_style) debug('\n Span role tag: ' + role_tag + '\n') if found_in_known_styles(role_tag): return elem `

jgm commented 1 year ago

I don't see the string toc- in the linked TOC example aboveo. So this wouldn't work for that.

ashsharma7 commented 1 year ago

@jgm It will only work if we use use -f docx+styles and the TOC was created using MS Word's Insert->Table of Contents

StephanMeijer commented 2 months ago

I'd be happy to exclude the generated TOC, but it doesn't look easy to recognize. Several paragraphs at the same level, one for the TOC title, then several more

Please note that Table of Contents could possibly contain text paragraphs such as forewords in documents, I have seen examples of those..

What is the "signature" of these elements that we can count on across different versions, etc.?

Usually //w:sdt[w:sdtPr/w:docPartObj/w:docPartGallery/@w:val='Table of Contents'] but this is not always the case, nor always specific to strictly elements that are part of the Table of Contents as it could possibly contain text paragraphs.