smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.41k stars 536 forks source link

Unable to parse PDF bookmarks (Outlines) #236

Open codingWWW opened 5 years ago

codingWWW commented 5 years ago

I am able to parse PDF metadata, and pages but I need PDF bookmark outlines with page numbers. Can you please help how to get them? I searched everywhere and read the doc as well but did not found anything useful.

rw152 commented 4 years ago

Hey, I was going to submit a PR and go through the hoops regarding testing, but since it doesn't seem like PRs are actively monitored and since I'm not familiar with testing with Atoum, I'm not going to go through the trouble.

Anyways, this little function I wrote seems to do the trick. I needed this functionality for my own testing purposes.

        $parser = new \Smalot\PdfParser\Parser();
        $pdf = $parser->parseFile($path);
        $bookmarks = $this->extractBookmarks($pdf);

  /**
     * @param $pdf
     * @return array
     */
    private function extractBookmarks($pdf){
        $bookmarks = [];
        foreach ($pdf->getObjects() as $obj){
            $details = $obj->getHeader()->getDetails();
            if (isset($details['Title'])){
                $bookmarks[] = $details['Title'];
            }
        }
        return $bookmarks;
    }

This will just return an array of the bookmark names. Hope this helps get you started.

kobs30 commented 4 years ago

Hi, but I don't see target pages for bookmarks in the Details array (

isavepak commented 3 years ago

true, how can we get page numbers of bookmarks and page links? can someone reply?

isavepak commented 3 years ago

Hey, I was going to submit a PR and go through the hoops regarding testing, but since it doesn't seem like PRs are actively monitored and since I'm not familiar with testing with Atoum, I'm not going to go through the trouble.

Anyways, this little function I wrote seems to do the trick. I needed this functionality for my own testing purposes.

        $parser = new \Smalot\PdfParser\Parser();
        $pdf = $parser->parseFile($path);
        $bookmarks = $this->extractBookmarks($pdf);

  /**
     * @param $pdf
     * @return array
     */
    private function extractBookmarks($pdf){
        $bookmarks = [];
        foreach ($pdf->getObjects() as $obj){
            $details = $obj->getHeader()->getDetails();
            if (isset($details['Title'])){
                $bookmarks[] = $details['Title'];
            }
        }
        return $bookmarks;
    }

This will just return an array of the bookmark names. Hope this helps get you started.

Please help getting the page numbers of bookmarks and page links.

k00ni commented 3 years ago

@rw152: if you are still willing to contribute, we introduced PHPUnit a while ago. Your function seems useful so as one of the maintainers I can assist you here.

@isavepak:

Please help getting the page numbers of bookmarks and page links.

If I remember correctly, it is currently not possible.

erichhaemmerle commented 3 years ago

I too am in need of this, but it looks like the data is just not there. Question though. I used the above code successfully to grab the bookmark titles, but the array that is returned is not in the order the bookmarks appear on the page. It seems almost random. It's not even ABC, it's just random. Do we know of a way to get the bookmarks in the same order they appear in the document?

josh-gaby commented 3 weeks ago

I know this is an old issue but I've just run in to this problem and found that I could get the page number for bookmarks by modifying the earlier function as below:

private function extractBookmarks($pdf){
    $bookmarks = [];
    foreach ($pdf->getObjects() as $obj){
        $details = $obj->getHeader()->getDetails();
        if (isset($details['Title']) && isset($obj->getHeader()->getElements()['Dest'])) {
            $page_no = $obj->getHeader()->getElements()['Dest']->getContent()[0]->getPageNumber();
            $bookmarks[] = ['label' => $details['Title'], 'page' => $page_no];
        }
    }
    return $bookmarks;
}

It may be better to change $bookmarks[] = ['label' => $details['Title'], 'page' => $page_no]; to $bookmarks[$details['Title']] = $page_no; so the output is indexed on the bookmark title and the value is page number.