Open robertovaldesperez opened 4 months ago
@Bogdancev can you help me?
It seems that the attached PDF is 0 bytes. Also please include a small program that demonstrates the issue.
Hi @HakanL I send you an example file: 6.335.1 0034220637_tasacion.pdf
My code:
using PdfSharpCore.Pdf;
using PdfSharpCore.Pdf.Advanced;
using PdfSharpCore.Pdf.IO;
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace Guru.Utils.Helper
{
public static class PdfSharpCoreExtensions
{
public static ISet<byte[]> ExtractImages(this byte[] contents)
{
using var pdfStream = new MemoryStream(contents);
try
{
var document = PdfReader.Open(pdfStream, PdfDocumentOpenMode.ReadOnly);
var uniqueImages = new HashSet<byte[]>();
var images = new HashSet<byte[]>();
foreach (var page in document.Pages)
{
foreach (var xObject in GetXObjectImages(page))
{
try
{
var value = xObject.Stream.Value;
if (!uniqueImages.Any(w => StructuralComparisons.StructuralEqualityComparer.Equals(w, value)))
{
uniqueImages.Add(value);
if (xObject.Elements.GetString("/Filter") == "/FlateDecode")
{
// TODO
}
else
{
using var image = new MagickImage(value);
images.Add(image.ToByteArray());
}
}
}
catch (Exception)
{
// Do nothing
}
}
}
return images;
}
catch (Exception)
{
// Do nothing
}
return new HashSet<byte[]>();
}
private static IEnumerable<PdfDictionary> GetXObjectImages(PdfDictionary pdfDictionary)
{
var resources = pdfDictionary.Elements.GetDictionary("/Resources");
if (resources != null)
{
var xObjects = resources.Elements.GetDictionary("/XObject");
if (xObjects != null)
{
foreach (var item in xObjects.Elements.Values)
{
if (item is PdfReference reference)
{
if (reference.Value is PdfDictionary xObject)
{
if (xObject.Elements.GetString(PdfImage.Keys.Subtype) == "/Image")
{
yield return xObject;
}
else
{
foreach (var xObject1 in GetXObjectImages(xObject))
{
if (xObject1.Elements.GetString(PdfImage.Keys.Subtype) == "/Image")
{
yield return xObject1;
}
}
}
}
}
}
}
}
}
}
}```
It looks like you just extract the byte array and then you're using a library called MagickImage for the image processing. My guess is that's where the issue is, it may not handle how PDF images are saved correctly.
Hi @HakanL I have tried this file UVE 01.pdf as well, and it extracts all the images fine.
I don't know if it's the way the PDF is saved. Can you debug the pdf (6.335.1 0034220637_tasacion.pdf) internally to see if anything indicates that the image is flipped vertically?
Thanks a lot.
It may be a different format inside the PDF. Unfortunately I don't have a set up to debug this, I'm not a developer on this project, but the source code is available so perhaps you can try to analyze it. But to my recollection this project doesn't analyze/read/parse the images, so it's my belief that it's not an issue with the PdfSharCore project.
Hello @jafin @ststeiger @saikatguha @nils-a @HakanL @Bogdancev, I am extracting the images from a PDF, apparently in the PDF they are fine, but when I extract them they are returned vertically flipped.
I send you an example file: 6.335.1 0034220637_tasacion.pdf
Thanks a lot.