Open redsuperbat opened 8 months ago
I tried recreating the problem and got the same issue. It says its corrupted when opening it on Windows.
Here is a diff between a good and a bad powerpoint after it has been repaired by Keynote on mac. The good one is able to be opened in microsoft powerpoint: https://gist.github.com/redsuperbat/8b2b43feb922d4e014abe6ca6501caad
@scanny Do you have any idea what might be wrong? This seems like a high prio issue 😱
Hmm, nothing is jumping out at me from the diff, but there are quite a few things being added that don't seem necessary. I expect that's because you used Keynote.
The default "starting" presentation in python-pptx
has zero slides, which Keynote might not like.
I'm not able to reproduce this on my machine (macOS BigSur running Python 3.9.17 and PowerPoint for Mac 2016 v16.16.27).
A couple things to try.
python-pptx
version you are using. Try with the latest (0.6.23) and also back up a couple versions, say to (0.6.21) and see if the behavior is the same or seems to have been recently introduced.Let me know what you find out, I expect some further experiments like this will narrow things down.
@scanny I can add some more info on this problem from my view aswell:
I can open the generated powerpoint in all places except for the desktop 365 Powerpoint app on Windows. I tried two different Windows PCs, same issue. I also tried generating an empty pptx and one with some content in it. Neither work.
The error I get:
PowerPoint can attempt to repair the presentation. If you trust the source of this presentation, click Repair.
Thanks for helping out
Hey @scanny. Thanks for answering.
I have tried both version 0.6.23 and 0.6.21. Both produced the same result. I have also tried using a template. The template produces the same result, ie. you cant open it in microsoft office on windows. I have also tried adding a slide, which also produces the same result.
Here is a diff of a powerpoint with around 10 slides which used a template using version 0.6.21: https://gist.github.com/redsuperbat/edafeaecd66d469bd20b7c7585046546
One which is openable and fixed by Keynote on mac and one which was produced by python-pptx
@redsuperbat we need as small a diff as possible to start with because you're going to need to modify the XML with a bisection strategy to identify the particular XML that is a problem. So zero slides is what we're after. I'm not recognizing any smoking gun from the diffs so we're going to need to narrow it down methodically and that goes fastest with the smallest example of "this one works" and "this one doesn't".
Good to know on the versions, that rules out any new work in the last couple years.
It's odd that opening a working file and saving it "breaks" the file, if that's what you've shown. If that's the case, it could be quicker to use the working file as the baseline and the one saved by python-pptx
as the delta. Maybe a diff of those would be a good next step, but again, we're looking for the smallest diff possible, so reduce the content to the minimum that still shows the problem.
What version of Microsoft Office are you using and on what version of Windows?
@scanny I will get you a diff, in the meantime here is the event emitted by microsoft office 365:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft Office 16 Alerts" />
<EventID Qualifiers="0">300</EventID>
<Version>0</Version>
<Level>4</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2023-11-04T18:41:49.9665651Z" />
<EventRecordID>11544</EventRecordID>
<Correlation />
<Execution ProcessID="0" ThreadID="0" />
<Channel>OAlerts</Channel>
<Computer>LAPTOP-P5L015SI</Computer>
<Security />
</System>
<EventData>
<Data>Microsoft PowerPoint</Data>
<Data>PowerPoint found a problem with content in C:\Users\User\AppData\Local\Microsoft\Windows\INetCache\Content.Outlook\OA1X2GFI\bad.pptx. PowerPoint can attempt to repair the presentation. If you trust the source of this presentation, click Repair.</Data>
<Data>400762</Data>
<Data>16.0.16924.20124</Data>
<Data />
<Data>0x80070570</Data>
</EventData>
</Event>
@scanny
Here is the diff, seems like it has something to do with python-pptx
stripping content types.
b'--- default/[Content_Types].xml
+++ bad-default/[Content_Types].xml
@@ -1,15 +1,7 @@
<?xml version=\'1.0\' encoding=\'UTF-8\' standalone=\'yes\'?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
- <Default Extension="bmp" ContentType="image/bmp"/>
- <Default Extension="gif" ContentType="image/gif"/>
- <Default Extension="jpeg" ContentType="image/jpg"/>
- <Default Extension="mov" ContentType="application/movie"/>
- <Default Extension="pdf" ContentType="application/pdf"/>
<Default Extension="png" ContentType="image/png"/>
<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
- <Default Extension="tif" ContentType="image/tif"/>
- <Default Extension="vml" ContentType="application/vnd.openxmlformats-officedocument.vmlDrawing"/>
- <Default Extension="xlsx" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/>
<Default Extension="xml" ContentType="application/xml"/>
<Override PartName="/docProps/app.xml" ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
<Override PartName="/docProps/core.xml" ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
'
Hmm, interesting. That's easy enough to check. Let me have a look ... ah, yes. The default presentation only declares four default extensions:
<Default Extension="xml" ContentType="application/xml"/>
<Default Extension="jpeg" ContentType="image/jpeg"/>
<Default Extension="bin" ContentType="application/vnd.openxmlformats-officedocument.presentationml.printerSettings"/>
<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
But interestingly "bin"
does not appear in the diff and "jpeg"
gets a slightly different ContentType
value ("image/jpg"
instead of "image/jpeg"
).
Can you describe the exact experiment you conducted to produce these results? We need to be like scientists here and maintain rigor in our experimental findings if we want to get this right the first time. In particular describe how you produced default and bad-default.
On the event, I don't see anything specific there other than it gives me the slightest whiff that it might be security-related.
Also you're going to need some way to modify the XML directly so you can see what loads without error and what doesn't. Visual Studio Code does that automatically, which is a great option. Neovim does that slightly more faithfully it appears, which is what I use. Also opc-diag
is an option but I think you'll need to install that from the develop
branch on GitHub here to get it to work with Python 3: https://github.com/python-openxml/opc-diag/commits/develop
Thanks for looking into it @scanny Here are the things i did to create the good and faulty presentations:
default.pptx
default.pptx
template:
from pptx import Presentation
pp = Presentation("./templates/default.pptx") pp.save("bad-default.pptx")
3. Create the diff with `opc-diag` by running `opc diff default.pptx bad-default.pptx>diff.txt`
> Also you're going to need some way to modify the XML directly so you can see what loads without error and what doesn't. Visual Studio Code does that automatically, which is a great option.
I use VS-code, how would i modify the file? VS code seems to interpret it as a binary file and using openxml explorer does not let me access the content-type file.
Thanks for your help!
Okay, I have a strong suspicion that the bin ... printerSettings
extension and probably the ppt/printerSettings/printerSettings1.bin
member in the default .pptx
file is a likely cause of this.
Here's the experimental procedure:
Using your XML editing capability, whatever you've landed on, remove just the:
<Default Extension="bin" ContentType="application/vnd.openxmlformats-officedocument.presentationml.printerSettings"/>
line from the [ContentTypes].xml
on bad-default (work on a copy of course).
Remove this line from ppt/_rels/presentation.xml.rels
in bad-default:
<Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/printerSettings" Target="printerSettings/printerSettings1.bin"/>
ppt/printerSettings/printerSettings1.bin
member from the bad-default.pptx zip archive.There is some evidence on Google search that a problematic printerSettings.bin can cause a repair-error on open, I expect because someone found an exploit for that binary file.
Looks like VSCode is not a great option, my mistake. I would go with opc-diag
for the time being.
pip install git+https://github.com/python-openxml/opc-diag.git@develop
Docs are here: https://opc-diag.readthedocs.io/en/latest/
First move would be to extract so you can inspect and edit the XML:
$ opc extract bad-default.pptx bad-default
Then just make the edits I mentioned above and repackage with:
$ opc repackage bad-default new-bad-default.pptx
Hmm, k, just reading your experimental method there, it looks like the printerSettings is not the culprit.
Do you have Windows PowerPoint available or only Keynote?
python-pptx
is only going to place content-types that actually occur in the presentation, whereas it looks like Keynote just adds in a large block of standard defaults.
I think it's time for a bisection approach:
[ContentTypes].xml
of bad-default
and see if that fixes the problem. That should give us a baseline and confirm that you're on the right track.@scanny I did what you said and was able to open it without repairing it on windows. We are getting closer, I'm going to disect and see which default extension is mandatory.
Ok @scanny
I have narrowed it down to this Default:
<Default Extension="jpeg" ContentType="image/jpg"/>
My guess is that microsoft office 365 cant handle Extension=jpg
and needs the Extension=jpeg
.
@scanny Scratch that, seems like there is something else which is not working properly here.
Yeah, "image/jpg"
does not appear anywhere in the python-pptx
code, so that wouldn't be it I don't think. We use image/jpeg
for all extensions in ("jpe", "jpeg", "jpg")
.
Your opc-diag error is coming from not installing from the develop
branch. Use the pip
command I mentioned.
Creating a brand new presentation like this:
Creates a test.pptx file which microsoft office cant open and needs to repair. It seems like python-pptx is doing something to corrupt the powerpoint file. Has anyone experienced the same problem?