mwilliamson / python-mammoth

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
811 stars 121 forks source link

Can convert_to_html save images in a separate dir? #10

Closed matt-erhart closed 8 years ago

matt-erhart commented 8 years ago

The CLI can do it but I don't see the option when called from the library. What do you recommend if there isn't an option?

mwilliamson commented 8 years ago

Try setting the convert_image argument when calling convert_to_html.

For instance, the CLI passes the below as convert_image:

convert_image = mammoth.images.inline(ImageWriter(args.output_dir))

class ImageWriter(object):
    def __init__(self, output_dir):
        self._output_dir = output_dir
        self._image_number = 1

    def __call__(self, element):
        extension = element.content_type.partition("/")[2]
        image_filename = "{0}.{1}".format(self._image_number, extension)
        with open(os.path.join(self._output_dir, image_filename), "wb") as image_dest:
            with element.open() as image_source:
                shutil.copyfileobj(image_source, image_dest)

        self._image_number += 1

        return {"src": image_filename}
matt-erhart commented 8 years ago

Hmmm, where is ImageWriter and how should I incorporate it? Can I import it first? Is there a little code snippet that would demonstrate how to do this?

import mammoth #and maybe from mammoth import ...
convert_image = mammoth.images.inline(ImageWriter(outdir))
result = mammoth.convert_to_html(docx_file,convert_image=convert_image)
matt-erhart commented 8 years ago

I've got it now. I just copy pasted the class. Just for anyone else who might read this, I had to do the following to save the html without errors:

html2write = u''.join(html).encode('utf-8').strip()
 with open("output.html", "w") as text_file:
        text_file.write(html2write)

Also going to need to add this to the html file:

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
mwilliamson commented 8 years ago

Glad you got it working. I think the writing can be simplified by opening the file with an encoding set, allowing it to write unicode strings directly:

import codecs

with codecs.open("output.html", encoding="utf-8", mode="w") as text_file:
    text_file.write(html)