openlibhums / pandoc_plugin

Plugin for janeway for automatic galley generation
GNU Affero General Public License v3.0
4 stars 1 forks source link

Create HTML without figure width/height attributes #9

Closed mdlincoln closed 5 years ago

mdlincoln commented 5 years ago

By default when converting from docx to HTML, Pandoc looks at the image size and DPI and creates height and width attributes on the resulting <img> tag like so:

<img src="media/image1.jpeg" width="276" height="345" />

We don't want Word formatting determining the display size of images on Janeway - that should be something handled by the theming engine. Either need to:

  1. Find a pandoc setting that will emit HTML without guessing at figure width/height
  2. Use beautifulsoup to strip the attributes from the HTML after it has been converted by Pandoc
  3. use lxml's Cleaner class and whitelist the only attributes we want to pass
mdlincoln commented 5 years ago

I may be able to address this by passing the docx through markdown-strict instead of markdown during pre-processing... will try this out

https://github.com/ajrbyers/pandoc_plugin/blob/a37bbd0160a2526aaeaa1b2a9a467b3ad2bda43d/views.py#L74

mdlincoln commented 5 years ago

Nope, turns out using commonmark will strip out the img width/height tags, but then also strip out footnote links 😬 I think this is probably something best handled when doing the galley transformation when you're rewriting the SRCs anyway

mdlincoln commented 5 years ago

Now that we've yanked XML from the picture, one solution could be this:

Don't write out an intermediate markdown file. Instead, capture the output of one pandoc subprocess as a string of HTML markup, use python to edit out both the troublesome media/ prefix on the src line as well as removing the width/height tags, and then write the final html to the expected filepath. This has the added bonus of not writing an intermediate markdown file, which is what the process currently does.

I can do a PR for this.