Closed mdlincoln closed 5 years ago
I may be able to address this by passing the docx through markdown-strict
instead of markdown
during pre-processing... will try this out
https://github.com/ajrbyers/pandoc_plugin/blob/a37bbd0160a2526aaeaa1b2a9a467b3ad2bda43d/views.py#L74
Nope, turns out using commonmark
will strip out the img width/height tags, but then also strip out footnote links 😬 I think this is probably something best handled when doing the galley transformation when you're rewriting the SRCs anyway
Now that we've yanked XML from the picture, one solution could be this:
Don't write out an intermediate markdown file. Instead, capture the output of one pandoc subprocess as a string of HTML markup, use python to edit out both the troublesome media/
prefix on the src
line as well as removing the width/height
tags, and then write the final html to the expected filepath. This has the added bonus of not writing an intermediate markdown file, which is what the process currently does.
I can do a PR for this.
By default when converting from docx to HTML, Pandoc looks at the image size and DPI and creates
height
andwidth
attributes on the resulting<img>
tag like so:We don't want Word formatting determining the display size of images on Janeway - that should be something handled by the theming engine. Either need to:
lxml
'sCleaner
class and whitelist the only attributes we want to pass