sergey-tihon / Clippit

Fresh PowerTools for OpenXml
https://sergey-tihon.github.io/Clippit/
MIT License
47 stars 18 forks source link

PresentationBuilder.PublishSlides generates slides with different data #40

Open f1nzer opened 2 years ago

f1nzer commented 2 years ago

I'm using PresentationBuilder.PublishSlides to generate slides from the original pptx file. The problem is that this method returns non-deterministic results from run to run: slide's DocumentByteArray has different data - there is a difference in several bytes.

Is it an expected behavior or not? Thanks.

Simple repro code (NET SDK 6.0.100, Clippit 1.8.1):

using System.IO;
using System.Linq;
using Clippit.PowerPoint;
using DocumentFormat.OpenXml.Packaging;
using Xunit;

namespace PptxTest;

public class UnitTest1
{
    [Fact]
    public void PublishSlides_Should_GenerateSameDataInTwoRuns()
    {
        const string filePath = @"use any pptx file path";

        var sizesForSlides1 = SplitPptxAndGetByteSizesForSlides(filePath);
        var sizesForSlides2 = SplitPptxAndGetByteSizesForSlides(filePath);

        Assert.Equal(sizesForSlides1, sizesForSlides2);
    }

    private static int[] SplitPptxAndGetByteSizesForSlides(string filePath)
    {
        using var fileContentStream = File.OpenRead(filePath);
        using var document = PresentationDocument.Open(fileContentStream, false);
        var slides = PresentationBuilder.PublishSlides(document, null);

        return slides.Select(slide => slide.DocumentByteArray.Length).ToArray();
    }
}
sergey-tihon commented 2 years ago

I am not quite sure, but I think that ZIP archives (.pptx, .docx, *.xlsx) are not deterministic by their native

According to Wikipedia http://en.wikipedia.org/wiki/Zip_(file_format) seems that zip files have headers for File last modification time and File last modification date so any zip file checked into git will appear to git to have changed if the zip is rebuilt from the same content since. And it seems that there is no flag to tell it to not set those headers.

From SO

f1nzer commented 2 years ago

That's interesting.

In addition to that, in my case _rels/.rels file has several <Relationship .. tags where Id are unique (another file has a different id set). The same story for other rels files (see ppt folder).

sergey-tihon commented 2 years ago

Ha! You are right, OpenXmlPowerTools historically uses GUIDs as relationship IDs https://github.com/sergey-tihon/Clippit/blob/e0da582d4f0149788429224f5bffeae4cffe96ff/OpenXmlPowerTools/PowerPoint/PresentationBuilderTools.cs#L533-L534