scanny / python-pptx

Create Open XML PowerPoint documents in Python
MIT License
2.26k stars 499 forks source link

Microsoft office interprets python-pptx powerpoint files as corrupted #924

Open redsuperbat opened 8 months ago

redsuperbat commented 8 months ago

Creating a brand new presentation like this:

from pptx import Presentation
pp = Presentation()
pp.save("text.pptx")

Creates a test.pptx file which microsoft office cant open and needs to repair. It seems like python-pptx is doing something to corrupt the powerpoint file. Has anyone experienced the same problem?

oliverjohns commented 8 months ago

I tried recreating the problem and got the same issue. It says its corrupted when opening it on Windows.

redsuperbat commented 8 months ago

Here is a diff between a good and a bad powerpoint after it has been repaired by Keynote on mac. The good one is able to be opened in microsoft powerpoint: https://gist.github.com/redsuperbat/8b2b43feb922d4e014abe6ca6501caad

redsuperbat commented 8 months ago

@scanny Do you have any idea what might be wrong? This seems like a high prio issue 😱

scanny commented 8 months ago

Hmm, nothing is jumping out at me from the diff, but there are quite a few things being added that don't seem necessary. I expect that's because you used Keynote.

The default "starting" presentation in python-pptx has zero slides, which Keynote might not like.

I'm not able to reproduce this on my machine (macOS BigSur running Python 3.9.17 and PowerPoint for Mac 2016 v16.16.27).

A couple things to try.

Let me know what you find out, I expect some further experiments like this will narrow things down.

oliverflyttsmart commented 8 months ago

@scanny I can add some more info on this problem from my view aswell:

I can open the generated powerpoint in all places except for the desktop 365 Powerpoint app on Windows. I tried two different Windows PCs, same issue. I also tried generating an empty pptx and one with some content in it. Neither work.

The error I get:

PowerPoint can attempt to repair the presentation. If you trust the source of this presentation, click Repair.

Thanks for helping out

redsuperbat commented 8 months ago

Hey @scanny. Thanks for answering.

I have tried both version 0.6.23 and 0.6.21. Both produced the same result. I have also tried using a template. The template produces the same result, ie. you cant open it in microsoft office on windows. I have also tried adding a slide, which also produces the same result.

Here is a diff of a powerpoint with around 10 slides which used a template using version 0.6.21: https://gist.github.com/redsuperbat/edafeaecd66d469bd20b7c7585046546

One which is openable and fixed by Keynote on mac and one which was produced by python-pptx

scanny commented 8 months ago

@redsuperbat we need as small a diff as possible to start with because you're going to need to modify the XML with a bisection strategy to identify the particular XML that is a problem. So zero slides is what we're after. I'm not recognizing any smoking gun from the diffs so we're going to need to narrow it down methodically and that goes fastest with the smallest example of "this one works" and "this one doesn't".

Good to know on the versions, that rules out any new work in the last couple years.

It's odd that opening a working file and saving it "breaks" the file, if that's what you've shown. If that's the case, it could be quicker to use the working file as the baseline and the one saved by python-pptx as the delta. Maybe a diff of those would be a good next step, but again, we're looking for the smallest diff possible, so reduce the content to the minimum that still shows the problem.

What version of Microsoft Office are you using and on what version of Windows?

redsuperbat commented 8 months ago

@scanny I will get you a diff, in the meantime here is the event emitted by microsoft office 365:

<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
  <Provider Name="Microsoft Office 16 Alerts" />
  <EventID Qualifiers="0">300</EventID>
  <Version>0</Version>
  <Level>4</Level>
  <Task>0</Task>
  <Opcode>0</Opcode>
  <Keywords>0x80000000000000</Keywords>
  <TimeCreated SystemTime="2023-11-04T18:41:49.9665651Z" />
  <EventRecordID>11544</EventRecordID>
  <Correlation />
  <Execution ProcessID="0" ThreadID="0" />
  <Channel>OAlerts</Channel>
  <Computer>LAPTOP-P5L015SI</Computer>
  <Security />
  </System>
<EventData>
  <Data>Microsoft PowerPoint</Data>
  <Data>PowerPoint found a problem with content in C:\Users\User\AppData\Local\Microsoft\Windows\INetCache\Content.Outlook\OA1X2GFI\bad.pptx. PowerPoint can attempt to repair the presentation. If you trust the source of this presentation, click Repair.</Data>
  <Data>400762</Data>
  <Data>16.0.16924.20124</Data>
  <Data />
  <Data>0x80070570</Data>
  </EventData>
  </Event>
redsuperbat commented 8 months ago

@scanny

Here is the diff, seems like it has something to do with python-pptx stripping content types.

b'--- default/[Content_Types].xml

+++ bad-default/[Content_Types].xml

@@ -1,15 +1,7 @@

 <?xml version=\'1.0\' encoding=\'UTF-8\' standalone=\'yes\'?>
 <Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
-  <Default Extension="bmp" ContentType="image/bmp"/>
-  <Default Extension="gif" ContentType="image/gif"/>
-  <Default Extension="jpeg" ContentType="image/jpg"/>
-  <Default Extension="mov" ContentType="application/movie"/>
-  <Default Extension="pdf" ContentType="application/pdf"/>
   <Default Extension="png" ContentType="image/png"/>
   <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
-  <Default Extension="tif" ContentType="image/tif"/>
-  <Default Extension="vml" ContentType="application/vnd.openxmlformats-officedocument.vmlDrawing"/>
-  <Default Extension="xlsx" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/>
   <Default Extension="xml" ContentType="application/xml"/>
   <Override PartName="/docProps/app.xml" ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
   <Override PartName="/docProps/core.xml" ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
'
scanny commented 8 months ago

Hmm, interesting. That's easy enough to check. Let me have a look ... ah, yes. The default presentation only declares four default extensions:

  <Default Extension="xml" ContentType="application/xml"/>
  <Default Extension="jpeg" ContentType="image/jpeg"/>
  <Default Extension="bin" ContentType="application/vnd.openxmlformats-officedocument.presentationml.printerSettings"/>
  <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>

But interestingly "bin" does not appear in the diff and "jpeg" gets a slightly different ContentType value ("image/jpg" instead of "image/jpeg").

Can you describe the exact experiment you conducted to produce these results? We need to be like scientists here and maintain rigor in our experimental findings if we want to get this right the first time. In particular describe how you produced default and bad-default.

On the event, I don't see anything specific there other than it gives me the slightest whiff that it might be security-related.

Also you're going to need some way to modify the XML directly so you can see what loads without error and what doesn't. Visual Studio Code does that automatically, which is a great option. Neovim does that slightly more faithfully it appears, which is what I use. Also opc-diag is an option but I think you'll need to install that from the develop branch on GitHub here to get it to work with Python 3: https://github.com/python-openxml/opc-diag/commits/develop

redsuperbat commented 8 months ago

Thanks for looking into it @scanny Here are the things i did to create the good and faulty presentations:

  1. Create a brand new presentation in Keynote, adding default layouts and some fonts. Export the presentation as a PowerPoint, saving it as default.pptx
  2. Run this code against the default.pptx template:
    
    from pptx import Presentation

pp = Presentation("./templates/default.pptx") pp.save("bad-default.pptx")


3. Create the diff with `opc-diag` by running `opc diff default.pptx bad-default.pptx>diff.txt`

> Also you're going to need some way to modify the XML directly so you can see what loads without error and what doesn't. Visual Studio Code does that automatically, which is a great option.

I use VS-code, how would i modify the file? VS code seems to interpret it as a binary file and using openxml explorer does not let me access the content-type file.

Thanks for your help!
scanny commented 8 months ago

Okay, I have a strong suspicion that the bin ... printerSettings extension and probably the ppt/printerSettings/printerSettings1.bin member in the default .pptx file is a likely cause of this.

Here's the experimental procedure:

  1. Using your XML editing capability, whatever you've landed on, remove just the:

    <Default Extension="bin" ContentType="application/vnd.openxmlformats-officedocument.presentationml.printerSettings"/>

    line from the [ContentTypes].xml on bad-default (work on a copy of course).

  2. Remove this line from ppt/_rels/presentation.xml.rels in bad-default:

    <Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/printerSettings" Target="printerSettings/printerSettings1.bin"/>
  3. Remove the ppt/printerSettings/printerSettings1.bin member from the bad-default.pptx zip archive.

There is some evidence on Google search that a problematic printerSettings.bin can cause a repair-error on open, I expect because someone found an exploit for that binary file.

scanny commented 8 months ago

Looks like VSCode is not a great option, my mistake. I would go with opc-diag for the time being.

pip install git+https://github.com/python-openxml/opc-diag.git@develop

Docs are here: https://opc-diag.readthedocs.io/en/latest/

First move would be to extract so you can inspect and edit the XML:

$ opc extract bad-default.pptx bad-default

Then just make the edits I mentioned above and repackage with:

$ opc repackage bad-default new-bad-default.pptx
scanny commented 8 months ago

Hmm, k, just reading your experimental method there, it looks like the printerSettings is not the culprit.

Do you have Windows PowerPoint available or only Keynote?

scanny commented 8 months ago

python-pptx is only going to place content-types that actually occur in the presentation, whereas it looks like Keynote just adds in a large block of standard defaults.

I think it's time for a bisection approach:

  1. Add all these missing Default extension elements into the [ContentTypes].xml of bad-default and see if that fixes the problem. That should give us a baseline and confirm that you're on the right track.
  2. If that works, only put in the first four (bisect) and see if that works. If it does only put in the first two. If it doesn't take out those four and put in the last four, etc. until you've narrowed it down to exactly which one, if any, is required. Could be two or more of course, but you'll get there in the quickest way by bisecting (binary search basically).
redsuperbat commented 8 months ago

@scanny I did what you said and was able to open it without repairing it on windows. We are getting closer, I'm going to disect and see which default extension is mandatory.

redsuperbat commented 8 months ago

Ok @scanny
I have narrowed it down to this Default:

<Default Extension="jpeg" ContentType="image/jpg"/>

My guess is that microsoft office 365 cant handle Extension=jpg and needs the Extension=jpeg.

redsuperbat commented 8 months ago

@scanny Scratch that, seems like there is something else which is not working properly here.

scanny commented 8 months ago

Yeah, "image/jpg" does not appear anywhere in the python-pptx code, so that wouldn't be it I don't think. We use image/jpeg for all extensions in ("jpe", "jpeg", "jpg").

scanny commented 8 months ago

Your opc-diag error is coming from not installing from the develop branch. Use the pip command I mentioned.