unidoc / unipdf

Golang PDF library for creating and processing PDF files (pure go)
https://unidoc.io
Other
2.54k stars 250 forks source link

[BUG] extractPageText crashes interface conversion core.PdfObject is *core.PdfObjectInteger, not *core.PdfObjectName #312

Closed Kleissner closed 4 years ago

Kleissner commented 4 years ago

Description

We've experienced the following crash:

panic: interface conversion: core.PdfObject is *core.PdfObjectInteger, not *core.PdfObjectName

github.com/unidoc/unipdf/extractor.(*Extractor).extractPageText.func1(0xc02a5378c0, 0xf34580, 0x18adb50, 0xf34640, 0x18adb50, 0xcf6e60, 0xc025ce21a8, 0xd2c520, 0xc046af26e0, 0x3ff0000000000000, ...)
    C:/Temp/Go/src/github.com/unidoc/unipdf/extractor/text.go:297 +0x3c5c
github.com/unidoc/unipdf/contentstream.(*ContentStreamProcessor).Process(0xc02692f770, 0xc029cd43f0, 0x0, 0x0)
    C:/Temp/Go/src/github.com/unidoc/unipdf/contentstream/processor.go:277 +0x4fe
github.com/unidoc/unipdf/extractor.(*Extractor).extractPageText(0xc02a466740, 0xc02a610000, 0xa771, 0xc029cd43f0, 0x0, 0x21307b0, 0x0, 0xc02a610000, 0xc00018b180, 0x21307b0)
    C:/Temp/Go/src/github.com/unidoc/unipdf/extractor/text.go:336 +0x60a
github.com/unidoc/unipdf/extractor.(*Extractor).ExtractPageText(0xc02a466740, 0x40e8c4, 0xd84fa0, 0xc02a5363f0, 0xc02692f950, 0xb68e2b)
    C:/Temp/Go/src/github.com/unidoc/unipdf/extractor/text.go:45 +0x5c
github.com/unidoc/unipdf/extractor.(*Extractor).ExtractTextWithStats(0xc02a466740, 0xc02a466740, 0x0, 0x0, 0x0, 0xc015f266c0, 0x0)
    C:/Temp/Go/src/github.com/unidoc/unipdf/extractor/text.go:36 +0x39
github.com/unidoc/unipdf/extractor.(*Extractor).ExtractText(...)
    C:/Temp/Go/src/github.com/unidoc/unipdf/extractor/text.go:29

This seems to be the affected code:

https://github.com/unidoc/unipdf/blob/971a00f1eda292f1401813cf774ec163a52cbd65/extractor/text.go#L295-L297

Expected Behavior

It shouldn't crash.

If you need the PDF that triggers this crash let me know, but it would take me a couple of hours since I'd have to go manually through logs to find it..

Solution

Use type assertion check?

    name, ok := *op.Params[0].(*core.PdfObjectName) 
    if !ok {
           break
        }

Interestingly, all the other cases in the switch statement are using the type assertion check, just this one isn't.

gunnsth commented 4 years ago

@Kleissner Thanks for reporting. I think we actually fixed this recently in PR: #305 See https://github.com/unidoc/unipdf/pull/305/files#diff-548b62b50679e083838dc4638c51875fR307 The fix is on the development branch. Will be released in next release early next week. Would be great if you can test it. Otherwise would be great if you can submit a problematic file for checking.

Kleissner commented 4 years ago

Yep that PR fixes it. Closing this issue.