ststeiger / PdfSharpCore

Port of the PdfSharp library to .NET Core - largely removed GDI+ (only missing GetFontData - which can be replaced with freetype2)
Other
1.08k stars 237 forks source link

Image flipped vertically #448

Open robertovaldesperez opened 4 months ago

robertovaldesperez commented 4 months ago

Hello @jafin @ststeiger @saikatguha @nils-a @HakanL @Bogdancev, I am extracting the images from a PDF, apparently in the PDF they are fine, but when I extract them they are returned vertically flipped.

I send you an example file: 6.335.1 0034220637_tasacion.pdf

Thanks a lot.

robertovaldesperez commented 4 months ago

@Bogdancev can you help me?

HakanL commented 4 months ago

It seems that the attached PDF is 0 bytes. Also please include a small program that demonstrates the issue.

robertovaldesperez commented 4 months ago

Hi @HakanL I send you an example file: 6.335.1 0034220637_tasacion.pdf

My code:


using PdfSharpCore.Pdf;
using PdfSharpCore.Pdf.Advanced;
using PdfSharpCore.Pdf.IO;
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Linq;

namespace Guru.Utils.Helper
{
    public static class PdfSharpCoreExtensions
    {
        public static ISet<byte[]> ExtractImages(this byte[] contents)
        {
            using var pdfStream = new MemoryStream(contents);
            try
            {
                var document = PdfReader.Open(pdfStream, PdfDocumentOpenMode.ReadOnly);
                var uniqueImages = new HashSet<byte[]>();
                var images = new HashSet<byte[]>();
                foreach (var page in document.Pages)
                {
                    foreach (var xObject in GetXObjectImages(page))
                    {
                        try
                        {
                            var value = xObject.Stream.Value;
                            if (!uniqueImages.Any(w => StructuralComparisons.StructuralEqualityComparer.Equals(w, value)))
                            {
                                uniqueImages.Add(value);
                                if (xObject.Elements.GetString("/Filter") == "/FlateDecode")
                                {
                                    // TODO
                                }
                                else
                                {
                                    using var image = new MagickImage(value);
                                    images.Add(image.ToByteArray());
                                }
                            }
                        }
                        catch (Exception)
                        {
                            // Do nothing
                        }
                    }
                }
                return images;
            }
            catch (Exception)
            {
                // Do nothing
            }
            return new HashSet<byte[]>();
        }

        private static IEnumerable<PdfDictionary> GetXObjectImages(PdfDictionary pdfDictionary)
        {
            var resources = pdfDictionary.Elements.GetDictionary("/Resources");
            if (resources != null)
            {
                var xObjects = resources.Elements.GetDictionary("/XObject");
                if (xObjects != null)
                {
                    foreach (var item in xObjects.Elements.Values)
                    {
                        if (item is PdfReference reference)
                        {
                            if (reference.Value is PdfDictionary xObject)
                            {
                                if (xObject.Elements.GetString(PdfImage.Keys.Subtype) == "/Image")
                                {
                                    yield return xObject;
                                }
                                else
                                {
                                    foreach (var xObject1 in GetXObjectImages(xObject))
                                    {
                                        if (xObject1.Elements.GetString(PdfImage.Keys.Subtype) == "/Image")
                                        {
                                            yield return xObject1;
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}```
HakanL commented 4 months ago

It looks like you just extract the byte array and then you're using a library called MagickImage for the image processing. My guess is that's where the issue is, it may not handle how PDF images are saved correctly.

robertovaldesperez commented 4 months ago

Hi @HakanL I have tried this file UVE 01.pdf as well, and it extracts all the images fine.

I don't know if it's the way the PDF is saved. Can you debug the pdf (6.335.1 0034220637_tasacion.pdf) internally to see if anything indicates that the image is flipped vertically?

Thanks a lot.

HakanL commented 4 months ago

It may be a different format inside the PDF. Unfortunately I don't have a set up to debug this, I'm not a developer on this project, but the source code is available so perhaps you can try to analyze it. But to my recollection this project doesn't analyze/read/parse the images, so it's my belief that it's not an issue with the PdfSharCore project.