thombrink / itext7.pdfimage

Pdf to image converter based on itext7
MIT License
16 stars 13 forks source link

An item with the same key has already been added. Key: [32768, itext.pdfimage.Models.TextChunk] #10

Open tsiank opened 3 years ago

tsiank commented 3 years ago

I meet a issue in code var bitmap1 = page.ConvertPageToBitmap(); when I use this pdfimage converter to convert pdf to jpg, details as below: System.ArgumentException HResult=0x80070057 Message=An item with the same key has already been added. Key: [32768, itext.pdfimage.Models.TextChunk] Source=System.Collections StackTrace: at System.Collections.Generic.TreeSet1.AddIfNotPresent(T item) at System.Collections.Generic.SortedDictionary2.Add(TKey key, TValue value) at iText.Kernel.Pdf.Canvas.Parser.Listener.TextListener.EventOccurred(IEventData data, EventType type) at iText.Kernel.Pdf.Canvas.Parser.Listener.FilteredEventListener.EventOccurred(IEventData data, EventType type) at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.EventOccurred(IEventData data, EventType type) at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.InvokeOperator(PdfLiteral operator, IList1 operands) at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.ProcessContent(Byte[] contentBytes, PdfResources resources) at itext.pdfimage.PdfToImageConverter.ConvertToBitmap(PdfPage pdfPage) at ATA.DeptOfProject4.EtsTest.PdfOp.ConvertPdfToJpg(String pdffile, String imageSource, Dictionary2 dict) in D:\pdfopt.cs:line 57

thombrink commented 3 years ago

Does this only occur with a certain pdf? If so, would you be so kind to upload it?

tsiank commented 3 years ago

Does this only occur with a certain pdf? If so, would you be so kind to upload it?

I have several pdfs to convert with "for" loop, I find it seems the bug appears when the total page numbers of all pdfs are about more than 300. My code snipes are as below:

`public static void ConvertPdfToJpg(string pdffile, string imageSource, Dictionary<string,string> dict)

    {
        string pdffileDir = pdffile.Replace(".pdf", "");
        if(!Directory.Exists(pdffileDir))
        {
            Directory.CreateDirectory(pdffileDir);
        }

        string filename = "";
        string pattern = @"Student Number:.*?[0-9]{15,17}";
        string pattern2 = @"Student Number:.*?([0-9]{15,17})";
        string replacement = "$1";
        Regex rgx = new Regex(pattern);
        Regex rgx2 = new Regex(pattern2);

        PdfDocument pdfDoc = new PdfDocument(new PdfReader(pdffile));

        int numberOfPages = pdfDoc.GetNumberOfPages();

        for (int i = 1; i <= numberOfPages; i++)
        {
            PdfPage page = pdfDoc.GetPage(i);
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
            string currentPageText = PdfTextExtractor.GetTextFromPage(page, strategy);
            MatchCollection matchret = rgx.Matches(currentPageText);
            string examid = rgx2.Replace(matchret[0].Value, replacement);

            //MessageBox.Show(examid);

            var bitmap1 = page.ConvertPageToBitmap();

            if (dict.ContainsKey(examid))
            {
                filename = dict[examid];
                bitmap1.Save($@"{pdffileDir}\{filename}.jpg", ImageFormat.Jpeg);
            }
            else
            {
                bitmap1.Save($@"{pdffileDir}\!!!{examid}.jpg", ImageFormat.Jpeg);
            }

            bitmap1.Dispose();
        }
    }`
tsiank commented 3 years ago

because of my private pdf, May I have your email to send?

tsiank commented 3 years ago

I have found another bug, some pdf contents are lost in the converted jpg

thombrink commented 3 years ago

I have found another bug, some pdf contents are lost in the converted jpg

Please create a speparate issue and provide a example pdf.

tsiank commented 3 years ago

I downloaded your source code and tried to debug , I found if the key type of chunkDictionairy is setted to double , then this bug could be fixed, but I don't sure whether it works if the pdf page numbers add more.