Open kopax opened 4 years ago
@jkr, I'd be curious what you think.
The toc seems to be in a sdtContent
element.
http://www.datypic.com/sc/ooxml/e-w_sdtContent-1.html
If this is the normal way autogenerated TOCs come, we could strip it out in the docx reader.
@jgm why not just <w:sdt>
which wrapped the whole?
Actually we probably don't want to remove either sdt or sdtContent indiscriminantly. This appears to be used for things like equations, bibliographies, etc. http://www.datypic.com/sc/ooxml/e-w_sdtPr-1.html We should target the toc as specifically as possible.
Sorry to be so behind on this -- I definitely think it makes sense to remove generated TOC data. It shouldn't be particularly hard. It seems, from looking around the spec and a few docs I have, that we should be filtering on the docPartGallery
. (sdt > sdtPr > docPartObj > docPartGallery
). That matches with our output. But it might be worthwhile to collect a few more in-the-wild examples (many of the docs in my computer were produces by pandoc at some point in their history).
Or we could just filter on this one now, and then add other weird undocumented edge-cases as they pop up in practice.
Or we could just filter on this one now, and then add other weird undocumented edge-cases as they pop up in practice.
That makes sense. Maybe the poster of #5893 can post another sample to look at.
Hi y'all. Thanks for looking into this issue!
Following up from #5893, I'm working with the journal editors to confirm they don't mind me providing their DOCX files as examples. Hopefully / probably they won't mind as these files are now published as Open Access articles but we'll see 🤷♂
I'll follow up as soon as I hear back – probably a few days.
PS: I'm no DOCXpert but, in my examples, it looks like there may be more than one way to generate TOC – so that's annoying / weird. Hopefully, fixing this issue isn't prohibitively complicated.
Good news! I got the okay to share these DOCX files.
I have attached two DOCX files where the TOC is setup slightly differently. I'm not sure if these differences are technically significant but figured it would be best to provide both as they may be distinct use cases / problems.
In this doc, the TOC does not internally link to the headers (i.e. the TOC is not clickable). When I attempt to convert this DOCX to MD, the TOC is present in the MD, alongside the other front matter – even though I am not specifying the --toc
flag. For example, my MD looks like:
In this doc, the TOC does internally link to the headers (i.e. the TOC is clickable). When I attempt to convert this DOCX to MD, I have two problems:
Again, both these problems occur eve though I am not specifying the --toc
flag.
Thanks again for looking into this! Hopefully these files help.
Hi. Friendly ping to see if there's any progress on this issue.
I have solved this issue by doing this: 1) I generate TOC automatically in docx by using the menu option Insert->Table of Contents. 2) use -f docx+styles in pandoc command. 3) Use a panflute filter to filter out every div which has TOC in its custom-style attribute.
@ashsharma7 That's an interesting method.
For my project, I am thinking to pre-edit the .xml
files in the .docx
file (it's an archive) to filter out those styles, before conversion.
@StephanMeijer Filtering xml means using more python libraries. So its an overhead.
I tried filtering things in xml but word xml is not an easy and consistent format. Too many edge cases there. Keep me posted. 👍
@ashsharma7 Yeah it's a bit more overhead, but filtering it after conversion means more useless <div>
's.
@ashsharma7 You would be able to do it on the commandline with xmlstarlet
after unzipping a DOCX.
Command would look like this:
$ xmlstarlet ed -L -d "//w:sdt[w:sdtPr/w:docPartObj/w:docPartGallery/@w:val='Table of Contents']" document.xml
This seem to work for properly formatted tables of contents for both Microsoft Word and Google Docs.
After that, you can zip it back.
Haven't tested if this corrupts the .docx
but I suspect it won't.
@StephanMeijer Should work. My issue is that a lot of times, there is xml inserted in different places across a closing tag. For example some overlapping styles can trip this up (like adding a Bold style on last line of TOC and first line of the next bit of content). And its easy to get in a docx if you are copy-pasting content from one docx to another.
I try to avoid dealing in xml as much as possible. I did run a few ideas using python-docx library but ultimately abandoned that approach after I saw all that can be in these word xmls. YMMV.
For example some overlapping styles can trip this up (like adding a Bold style on last line of TOC and first line of the next bit of content).
@ashsharma7 Would you be so kind sharing some of these documents? It would be very interesting also for my specific use-case.
I'd be happy to exclude the generated TOC, but it doesn't look easy to recognize. Several paragraphs at the same level, one for the TOC title, then several more. What is the "signature" of these elements that we can count on across different versions, etc.? Looking for styles of the form TOCxx would work for the linked TOC document you uploaded -- except for the TOC title. But is that always present?
`from panflute import * PASS_THRU_CSS_STYLES = ['tag', 'hyperlink', 'footnote-reference', 'toc-1', 'toc-2', 'toc-3', 'toc-4', 'toc-5', 'toc-6', 'toc-7']
def process_custom_style_string(custom_style): return '-'.join(custom_style.strip().lower().split(' '))
def found_in_known_styles(custom_style_css):
# or custom via our css.
for item in PASS_THRU_CSS_STYLES:
if 'tag' in custom_style_css:
return True
return False
def handle_span_styles(elem, doc): if isinstance(elem, Span) and 'custom-style' in elem.attributes.keys(): custom_style = elem.attributes['custom-style'] role_tag = process_custom_style_string(custom_style) debug('\n Span role tag: ' + role_tag + '\n') if found_in_known_styles(role_tag): return elem `
I don't see the string toc-
in the linked TOC example aboveo. So this wouldn't work for that.
@jgm It will only work if we use use -f docx+styles and the TOC was created using MS Word's Insert->Table of Contents
I'd be happy to exclude the generated TOC, but it doesn't look easy to recognize. Several paragraphs at the same level, one for the TOC title, then several more
Please note that Table of Contents could possibly contain text paragraphs such as forewords in documents, I have seen examples of those..
What is the "signature" of these elements that we can count on across different versions, etc.?
Usually //w:sdt[w:sdtPr/w:docPartObj/w:docPartGallery/@w:val='Table of Contents']
but this is not always the case, nor always specific to strictly elements that are part of the Table of Contents as it could possibly contain text paragraphs.
I am using pandoc
v1.7
andv2.7.3
to convert a.docx
document into.pdf
using--toc
option..docx
file was created and exported from Google Doc, you can download it:sample.docx
wetransfer link.v1.7 pdf generation
The PDF is having one TOC:
--toc
.v2.7.3 pdf generation
The PDF is having two TOC:
--toc
.docx
not removed TOC (with a poor formating).Expected
I expect to have the same result as in pandoc
v1.7
Possible solution
I am about to achieve this on my side but I'd like to know first if it can be restored somehow.