mwilliamson / python-mammoth

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
811 stars 121 forks source link

Question: How to convert embedded x-emf images? #41

Closed bitscompagnie closed 7 years ago

bitscompagnie commented 7 years ago

Hello,

How can we convert embedded .x-emf images to png or jpg? Is there any option/setting to output the embedded images to png or jpg instead of .x-emf?

Currently when I convert docx files, I get some images in the output-dir with .x-emf format and would need to convert them to png or jpg during docx conversion process.

Thanks for your help.

mwilliamson commented 7 years ago

I'm not aware of a good way (in Python) to convert WMF/EMF images, so Mammoth doesn't have an officially supported way. However, there are some undocumented image converters that use Libreoffice and ImageMagick to do so, which can be used like so:

def compose(f, g):
    def composed(*args, **kwargs):
        return f(g(*args, **kwargs))

    return composed

fileobj = open("document.docx", "rb")
result = mammoth.convert_to_html(
    fileobj,
    convert_image=compose(
        mammoth.images.data_uri,
        mammoth.images.libreoffice_wmf_conversion(post_process=mammoth.images.imagemagick_trim),
    ),
)

If those work for you, I'd suggest copying them the source for them since they're not officially supported, and may be changed or removed without warning.

bitscompagnie commented 7 years ago

Do I have to install particular versions of LibreOffice and ImageMagick to get the above code to work? I can test it on Mac or Windows. Your help/suggestion is really appreciated.

Thanks.

mwilliamson commented 7 years ago

I've no idea, it's just something I cobbled together that worked on my own Linux box.

On Thu, 10 Aug 2017 14:47:32 -0700 BITS notifications@github.com wrote:

Do I have to install particular versions of LibreOffice and ImageMagick to get the above code to work? I can test it on Mac or Windows. Your help/suggestion is really appreciated.

Thanks.

bitscompagnie commented 7 years ago

I got it working by following:

  1. Installed LibreOffice 4.3.72;

  2. Installed unoconv (universal office conversion utility), which works with the installed LibreOffice <= 4.3. On the Mac I did: brew install unoconv. Unoconv complained about not finding a suitable Then from the terminal I ran: unoconv --listener. LibreOffice installation when I used the most current version LibreOffice.

  3. Additionally to make it work with your code, I created a script as outlined here: Create a shell script at /usr/local/bin/soffice with the following content:

#!/bin/bash
# Need to do this because symlink won't work
# It complains about some .plist files
/Applications/LibreOffice.app/Contents/MacOS/soffice "$@"
#Make it executable
sudo chmod +x /usr/local/bin/soffice

I did above because I noticed that you were running libreoffice in headless mode in the /Library/Python/2.7/site-packages/mammoth/images.py file under libreoffice_wmf_conversiondefinition: ….

output_path = os.path.join(temporary_directory, "image.png")
                subprocess.check_call([
                    "libreoffice",
                    "--headless",
                    "--convert-to",
                    "png",
                    input_path,
                    "--outdir",
                    temporary_directory,
                ])

Once I completed above steps then it started working otherwise it will complain about a missing file. For any other user that might have the same question.

Thanks again.

bitscompagnie commented 7 years ago

How to save the converted images to files instead of embedding them in the html? I tried to include the --output-dir option but it did not work like we do when running the mammoth from the command line directly.

mwilliamson commented 7 years ago

You need to define your own image converter that will save images to disk rather than using mammoth.images.data_uri. For instance, you can see how the CLI sets the convert_image argument:

https://github.com/mwilliamson/python-mammoth/blob/500a2aca545c47b9677bd85e55b9b24dc4ec9c7c/mammoth/cli.py#L25

On Fri, 11 Aug 2017 13:45:29 +0000 (UTC) BITS notifications@github.com wrote:

How to save the converted images to files instead of embedding them in the html? I tried to include the --output-dir option but it did not work like we do when running the mammoth from the command line directly.

bitscompagnie commented 7 years ago

Thanks,

Here is my final working code with inspiration from issue #10:

for file in sourcedir:
    # Filter source documents to exclude temporary word files
    if file.endswith('.docx') and not file.startswith('~$'):
        sourcedocx = open('sourcedocs/' + file, 'rb')
        result = mammoth.convert_to_html(
    #    This works fine on Mac OS but not windows, need to fix it
        sourcedocx, style_map=style_map, convert_image=compose(
        # To save images to a directory
        mammoth.images.inline(ImageWriter(outdir)), 
        # Convert emf/wmf with LibreOffice, Unoconv and ImageMagick
        mammoth.images.libreoffice_wmf_conversion(post_process=mammoth.images.imagemagick_trim))
        )
        html = result.value
        # Write the result for each file to a new file in the output directory
        with codecs.open('outpudir/' + file + '.html', 'w', 'utf-8') as f:
            # Write each file to the destination folder
            f.write(html)
print('Done writing html files with python-mammoth')