ststeiger / PdfSharpCore

Port of the PdfSharp library to .NET Core - largely removed GDI+ (only missing GetFontData - which can be replaced with freetype2)
Other
1.04k stars 231 forks source link

Exception thrown when setting trailer info element to null #432

Open cacowen opened 3 months ago

cacowen commented 3 months ago

When opening specific documents I get an exception: {"Value cannot be null. (Parameter 'value')"}

Unfortunately, these documents come from an external source, and all of them have this issue. As a work around I have to manually open the pdf in acrobat or something and then save it (without doing anything). This seems to add something to the "INFO" where it does not throw an exception. I would like to be able to still open the file in code.

Stack trace:

at PdfSharpCore.Pdf.PdfDictionary.DictionaryElements.set_Item(String key, PdfItem value) in PdfSharpCore.Pdf\PdfDictionary.cs:line 49

at PdfSharpCore.Pdf.Advanced.PdfTrailer.Finish() in PdfSharpCore.Pdf.Advanced\PdfTrailer.cs:line 158

at PdfSharpCore.Pdf.IO.PdfReader.Open(Stream stream, String password, PdfDocumentOpenMode openmode, PdfPasswordProvider passwordProvider, PdfReadAccuracy accuracy) in PdfSharpCore.Pdf.IO\PdfReader.cs:line 380

This happens in this code when trying to set the info element to null: https://github.com/ststeiger/PdfSharpCore/blob/cdf089b6c4d6b379aead95f463911dd009ae194e/PdfSharpCore/Pdf.Advanced/PdfTrailer.cs#L191C13-L198C14

iref = _document._trailer.Elements[PdfTrailer.Keys.Info] as PdfReference;
if (iref != null && iref.Value == null)
{
    iref = _document._irefTable[iref.ObjectID]; // <-- this comes back as `null`
    Debug.Assert(iref.Value != null);
    _document._trailer.Elements[Keys.Info] = iref; // <-- this causes the exception
}

Expected behavior:

not to crash - allow setting the value to null or skip setting the element if the value is null

GeorgRottensteiner commented 2 months ago

Running into the exact same issue. Since my usage is automated extraction manual repairing of the file is not an option.

Edit: Simple skipping setting the reference does not help fully for me. The document ends up with zero pages, although shows up fine with any other PDF viewer. Unfortunately I cannot share the document as it contains private data.

StLange commented 1 month ago

It seems that the PDF is formatted incorrectly and Acrobat can fix it. We would like to fix this, but without the PDF file it is not possible. Please send the file to “issues (at) pdfsharp.net” We keep the PDF file secret and only use it to fix the bug.

cacowen commented 1 month ago

File has been sent. Issue still happens in PdfSharpCore. The file can be read in PdfSharp, PdfPig, and others. Thank you.

StLange commented 1 month ago

I have received the file. The reason for the issue is that the reference to object 312 is mentioned twice in the file. In PDFsharp (not in PdfSharpCore) I fixed this by removing the first entry when an identical second entry occurred. See original source code from PDFsharp 6.1 below.

In PdfSharpCore just call ObjectTable.Remove(iref.ObjectID) if the object is already in the table. This is line 75 in PdfCrossReferenceTable.cs in PdfSharpCore. I did not test it, but I’m pretty sure that it works.

        /// <summary>
        /// Adds a cross-reference entry to the table. Used when parsing the trailer.
        /// </summary>
        public void Add(PdfReference iref)
        {
            if (iref.ObjectID.IsEmpty)
                iref.ObjectID = new(GetNewObjectNumber());

            // ReSharper disable once CanSimplifyDictionaryLookupWithTryAdd because it would not build with .NET Framework.
            if (ObjectTable.ContainsKey(iref.ObjectID))
            {
#if true_
                // Really happens with existing (bad) PDF files.
                // See file 'Detaljer.ARGO.KOD.rev.B.pdf' from https://github.com/ststeiger/PdfSharpCore/issues/362
                throw new InvalidOperationException("Object already in table.");
#else
                // We remove the existing one and use the latter reference.
                // HACK: This is just a quick fix that may not be the best solution in all cases.
                // On GitHub user packdat provides a PR that orders objects. This code is not yet integrated,
                // because releasing 6.1.0 had a higher priority. We will fix this in 6.2.0.
                // However, this quick fix is better than throwing an exception in all cases.
                PdfSharpLogHost.PdfReadingLogger.LogError("Object '{ObjectID}' already exists in xref table. The latter one is used.", iref.ObjectID);
                ObjectTable.Remove(iref.ObjectID);
#endif
            }
            ObjectTable.Add(iref.ObjectID, iref);
        }