rsc / pdf

PDF reader
BSD 3-Clause "New" or "Revised" License
510 stars 327 forks source link

panic on some PDFs + suspect memory leak #9

Open mark-summerfield opened 7 years ago

mark-summerfield commented 7 years ago

I have the following Go program that uses this library:

package main

import (
    "fmt"
    "os"
    "strconv"
    "rsc.io/pdf"
)

func main() {
    if len(os.Args) < 2 || os.Args[1] == "-h" || os.Args[1] == "--help" {
        fmt.Println("usage: pdfpage file.pdf [pnum]")
        os.Exit(1)
    }
    reader, err := pdf.Open(os.Args[1])
    if err != nil {
        fmt.Println(err)
        os.Exit(2)
    }
    if len(os.Args) == 3 {
        var pnum int
        var err error
        if pnum, err = strconv.Atoi(os.Args[2]); err != nil {
            pnum = 1
        }
        fmt.Printf("PAGE %d\n", pnum)
        printPage(reader, pnum)
    } else {
        for pnum := 1; pnum <= reader.NumPage(); pnum++ {
            fmt.Printf("PAGE %d\n", pnum)
            printPage(reader, pnum)
            fmt.Println("")
        }
    }
}

func printPage(reader *pdf.Reader, pnum int) {
    page := reader.Page(pnum)
    if page.V.IsNull() {
        fmt.Printf("failed to read page %d\n", pnum)
        os.Exit(3)
    }
    for _, chunk := range page.Content().Text {
        fmt.Printf("x=%06.2f y=%06.2f w=%06.2f %q %s %.1fpt\n",
            chunk.X, chunk.Y, chunk.W, chunk.S, chunk.Font,
            chunk.FontSize)
    }
}

This builds and runs fine and for many PDFs gives the expected output (although it is rather slow). However I have a few PDFs which produce a panic:

PAGE 1
panic: malformed PDF: reading at offset 0: stream not present

goroutine 1 [running]:
rsc.io/pdf.(*buffer).errorf(0xc4200d3948, 0x507f70, 0x27, 0xc4200d36d0, 0x2, 0x2)
    /home/mark/app/go/src/rsc.io/pdf/lex.go:82 +0x74
rsc.io/pdf.(*buffer).reload(0xc4200d3948, 0x8)
    /home/mark/app/go/src/rsc.io/pdf/lex.go:95 +0x193
rsc.io/pdf.(*buffer).readByte(0xc4200d3948, 0x599da0)
    /home/mark/app/go/src/rsc.io/pdf/lex.go:71 +0x69
rsc.io/pdf.(*buffer).readToken(0xc4200d3948, 0xc42000aca0, 0x1000)
    /home/mark/app/go/src/rsc.io/pdf/lex.go:135 +0x4a
rsc.io/pdf.Interpret(0xc42006e060, 0x37, 0x4d78a0, 0xc42000ab60, 0xc4200d3b08)
    /home/mark/app/go/src/rsc.io/pdf/ps.go:64 +0x1c6
rsc.io/pdf.Page.Content(0xc42006e060, 0x37, 0x4db2e0, 0xc420014810, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/mark/app/go/src/rsc.io/pdf/page.go:613 +0x326
main.printPage(0xc42006e060, 0x1)
    /home/mark/app/go/src/pdfpage2/main.go:47 +0xa8
main.main()
    /home/mark/app/go/src/pdfpage2/main.go:35 +0x25d

I also have a 647 page PDF for which the program outputs the first 22 pages, then outputs PAGE 23 and then just sits there eating memory and using ~25% CPU. That particular page has some Japanese characters but I don't know if they are Unicode text or paths.

anacrolix commented 7 years ago

I get the same error on a PDF that's been "optimized", and on page 1.

anacrolix commented 7 years ago

This is the file that causes it: Species Present Report_Apr May Jun 2016.pdf

wayi1 commented 6 years ago

I got the same problem.

frontmill commented 6 years ago

I am getting this with every single pdf-file.

asticode commented 6 years ago

Hey guys,

I was having the same problem but after trying out another library I realized that my pdf file had a protection that prevented this library from extracting data.

After converting it with PDFCreator it removed the protection and I could read pages.

Hope it helps someone.

Cheers

bigzhu commented 5 years ago

I also have a lots of pdf throw error stream not present, but open this pdf by Mac "Preview" then "Export as PDF..." , the new exported pdf file can read and open fine.

maybe just need some pdf software open and resave the pdf, will fix this error?

use Automator batch transfer pdf is perfect.

florin0x01 commented 4 years ago

Is the library still maintained? I've come to the conclusion that it does not support PDF versions greater than 1.2 ( so things like LWZ compression, linearization and so on). Please reply. If not maintained, what is a great alternative?