Could not extract text from PDF

pankajinfo3 commented 6 months ago

System.ArgumentOutOfRangeException: Non-negative number required. (Parameter 'value') at System.IO.FileStream.set_Position(Int64 value) at UglyToad.PdfPig.Core.StreamInputBytes.Seek(Int64 position) at UglyToad.PdfPig.Parser.FileStructure.FileHeaderParser.TryBruteForceVersionLocation(Int64 startPosition, IInputBytes inputBytes, HeaderVersion& headerVersion) at UglyToad.PdfPig.Parser.FileStructure.FileHeaderParser.Parse(ISeekableTokenScanner scanner, IInputBytes inputBytes, Boolean isLenientParsing, ILog log) at UglyToad.PdfPig.Parser.PdfDocumentFactory.OpenDocument(IInputBytes inputBytes, ISeekableTokenScanner scanner, InternalParsingOptions parsingOptions) at UglyToad.PdfPig.Parser.PdfDocumentFactory.Open(Stream stream, ParsingOptions options) at UmbracoExamine.PDF.PdfPigTextExtractor.GetTextFromPdf(Stream pdfFileStream) at UmbracoExamine.PDF.PdfTextService.ExtractText(String filePath) at UmbracoExamine.PDF.PdfIndexValueSetBuilder.ExtractTextFromFile(String filePath)

Shazwazza commented 6 months ago

@pankajinfo3 Consider being a little more helpful and provide information to assist you like

Steps to reproduce Versions of software used The file that caused this issue

pankajinfo3 commented 6 months ago

When remindexing then get this error also not all files are indexed

pankajinfo3 commented 6 months ago

@Shazwazza,

here are other error when this error get when PDFIndex " Rebuild index "

Umbraco version - Umbraco version 12.3.6

For this issue PDFIndex search not show any result.

System.Collections.Generic.KeyNotFoundException: No item with key /Im3 in stack. at UglyToad.PdfPig.Util.StackDictionary2.get_Item(K key) at UglyToad.PdfPig.Content.ResourceStore.GetXObject(NameToken name) at UglyToad.PdfPig.Graphics.ContentStreamProcessor.ApplyXObject(NameToken xObjectName) at UglyToad.PdfPig.Graphics.ContentStreamProcessor.ProcessOperations(IReadOnlyList1 operations) at UglyToad.PdfPig.Graphics.ContentStreamProcessor.Process(Int32 pageNumberCurrent, IReadOnlyList1 operations) at UglyToad.PdfPig.Parser.PageFactory.GetContent(Int32 pageNumber, IReadOnlyList1 contentBytes, CropBox cropBox, UserSpaceUnit userSpaceUnit, PageRotationDegrees rotation, MediaBox mediaBox, InternalParsingOptions parsingOptions) at UglyToad.PdfPig.Parser.PageFactory.Create(Int32 number, DictionaryToken dictionary, PageTreeMembers pageTreeMembers, InternalParsingOptions parsingOptions) at UglyToad.PdfPig.Content.Pages.GetPage(Int32 pageNumber, InternalParsingOptions parsingOptions) at UglyToad.PdfPig.PdfDocument.GetPage(Int32 pageNumber) at UglyToad.PdfPig.PdfDocument.GetPages()+MoveNext() at UmbracoExamine.PDF.PdfPigTextExtractor.GetTextFromPdf(Stream pdfFileStream) at UmbracoExamine.PDF.PdfTextService.ExtractText(String filePath) at UmbracoExamine.PDF.PdfIndexValueSetBuilder.ExtractTextFromFile(String filePath)

Shazwazza commented 5 months ago

Hi, you can see in the stack trace that the core issue is coming from the UglyToad library that is doing the PDF reading. This is the Nuget package dependency that UmbracoExamine.PDF uses to read PDFs https://github.com/UglyToad/PdfPig. You can file an issue there, or see if just upgrading to the latest supported version of their library fixes the issue. Looks like the lastest stable version of that lib is 0.1.8 https://www.nuget.org/packages/PdfPig/0.1.8

You haven't mentioned what version of UmbracoExamine.PDF you are using either?

@bergmania I think you can close this since this isn't related to Umbraco code.

umbraco / UmbracoExamine.PDF

Could not extract text from PDF #42