smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.42k stars 538 forks source link

Exception Missing catalog #32

Open geraldmwanyika opened 10 years ago

geraldmwanyika commented 10 years ago

I am getting an exception when trying to parse PDF

Exception

Missing catalog.

 return $pages;
} elseif (isset($this->dictionary['Page'])) {
// Search for 'page' (unordered pages).
$pages = $this->getObjectsByType('Page');
return array_values($pages);
} else {
throw new \Exception('Missing catalog.');
}
}
DanielRuf commented 10 years ago

Can you show the PDF file?

jaberu commented 8 years ago

I have a similar problem.

Here you can find a broken pdf: https://github.com/jaberu/PhpOfficeUtils/blob/master/test/resources/missing-catalog.pdf

SiroDiaz commented 7 years ago

I have the same problem parsing a PDF. The exception is thrown in the line 246 inside the file src\Smalot\PdfParser\Document.php with the message Missing catalog.

AndyABX commented 7 years ago

I have the same trouble, when parsing with pdfparser. Fatal error: Uncaught exception 'Exception' with message 'Missing catalog.'

It's from document.php the last line: throw new \Exception('Missing catalog.'); (in public function getPages() )

So, does anybody know, how to make a workaround?

drexlma commented 7 years ago

Same Problem

renaudham commented 7 years ago

Hi

Me also but only withg one long and heavy PDF other even multiple pages are working ok

DanielRuf commented 7 years ago

@renaudham it would be great to have some testfile which throws this exception,

renaudham commented 7 years ago

Hi

here is one attached.

I have analyzed Document.php and it seems that the Dictionnary returns empty because there is no TYPE returned here

protected function buildDictionary() { // Build dictionary. $this->dictionary = array();

    foreach ($this->objects as $id => $object) {

test.pdf

        $type = $object->getHeader()->get('Type')->getContent();

       **// var_dump($type);**
        if (!empty($type)) {
            $this->dictionary[$type][$id] = $id;
        }
    }
}

so for my usage I simply modified few lines

public function getPages() { ... return array(); throw new \Exception('Missing catalog.'); ...

public function getText(Page $page = null)
{

... if(count($pages)==0){ return false; }

like that when I call a doc with this issue (no types) I will receive a "false" that I can use to swicth to different unparsable no content treatment (as of course I will get zero content from this pdf)

thanks

renaudham commented 7 years ago

the foreach ($this->objects as $id => $object) { $type = $object->getHeader() .... (cut the rest to get type and content)

with getHeader only return object elements

but it seems there is not "type" but also zero content extracted in the $this->objects

tim-peterson commented 6 years ago

I added this as a question on Stackoverflow to see if we could get some help.

https://stackoverflow.com/questions/48173527/continue-a-script-after-an-exception-is-thrown-php

The answer for me was to make sure my Exception was namespaced in my try/catch block.

So this: catch (\Exception $e), instead of catch (Exception $e).

DanielRuf commented 6 years ago

It is already namespaced and not directly related to the pdfparser bug.

aavrug commented 6 years ago

Same issue I am facing when trying to upload multiple pdfs, even in my script I have mentioned the namespace properly.

DanielRuf commented 6 years ago

The namespace is not the problem but the parsing of the PDF.

DanielRuf commented 6 years ago

Also please provide the full error/exception + stacktrace that you get.

aavrug commented 6 years ago

Here you can see the stack trace. https://gist.github.com/aavrug/ee26ebc55f618b8bc93823df23470a51

barrychapman commented 6 years ago

i am having this same problem, is there any chance of a fix?

djlift commented 6 years ago

I have the same issue. I can tell that if I open the file (on a mac) in Preview.app and save the pdf out and try to parse it again, then it works fine.

DanielRuf commented 6 years ago

So it is related to the tool that generates the PDF. I guess this depends on the used PDF version.

Can you check with Adobe Reader or some other tools which PDF version is used in both cases @djlift?

djlift commented 6 years ago

I just confirmed that I had a v1.6 and saved it down to to v1.3 and then I do not experience the issue.

DanielRuf commented 6 years ago

So it is still an issue with the data structure of newer PDF versions.

djlift commented 6 years ago

I would imagine, yes. I don't really know much about the structures or differences of the different versions unfortunately.

qlstorm commented 6 years ago

same issue, what we need to fix this?

alejandr0 commented 6 years ago

@smalot Are there news about the fix for this issue?

Thank you so much!

Pesche007 commented 6 years ago

Same here, many thanks in advance for this great library.

hnk15 commented 5 years ago

Hello Everyone!

I was also facing the same problem of missing catalog fatal error.

I have tried try, throw and catch and now i am not getting any missing catalog fatal error. Below is the code where i have applied try, throw and catch:

         ` public function getPages()
          {
     try{   
          if (isset($this->dictionary['Catalog'])) {
          // Search for catalog to list pages.
          $id = reset($this->dictionary['Catalog']);

        /** @var Pages $object */
        $object = $this->objects[$id]->get('Pages');
        if (method_exists($object, 'getPages')) {
            $pages = $object->getPages(true);
            return $pages;
        }
    }

    if (isset($this->dictionary['Pages'])) {
        // Search for pages to list kids.
        $pages = array();

        /** @var Pages[] $objects */
        $objects = $this->getObjectsByType('Pages');
        foreach ($objects as $object) {
            $pages = array_merge($pages, $object->getPages(true));
        }

        return $pages;
    }

    if (isset($this->dictionary['Page'])) {
        // Search for 'page' (unordered pages).
        $pages = $this->getObjectsByType('Page');

        return array_values($pages);
    }

    throw new \Exception('Missing catalog.');
}
catch(\Exception $e)
{
    $pages = '0';
}
}`

Best of luck!!

DanielRuf commented 5 years ago

@hnk15 can you provide these changes as patch file?

hnk15 commented 5 years ago

@DanielRuf what do you mean by patch file?

DanielRuf commented 5 years ago

A patch file is a file use by patch tools to change files base on a diff.

See https://patch-diff.githubusercontent.com/raw/smalot/pdfparser/pull/224.patch

vnbenny commented 5 years ago

Yeah come on @hnk15 , I also have this bug, more of us could make great use of your patch.

usabilitest commented 5 years ago

I just checked, and the fix suggested suggested by @hnk15 was not added to the code yet, however I tested it and it resolved my issue. If you're not sure what to do but still have missing catalog fatal error issue, simply download the following file Font.php from your server. If you used composer to install the package, most likely it'll be located in your vendor directory, something like this: vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php

Open the file in your editor and replace the line 259, which is: $part = pack('H*', $part); with the following line of code: $part = pack('H*', str_replace(' ', '', sprintf('%u', CRC32($part))));

Save the changes and re-upload the file to the same location.

DanielRuf commented 5 years ago

You can use composer-patches for example which makes it easier to apply it =)

ivanwitzke commented 5 years ago

Even using the patch provided, its not working for me... The file I'm trying to parse is 67Mb.

PHP Fatal error: Uncaught Exception: Missing catalog. in APP_FOLDER/vendor/smalot/pdfparser/src/Smalot/PdfParser/Document.php:248

DanielRuf commented 5 years ago

Is the PDF encrypted and which PDF standard version is used @ivanwitzke?

ivanwitzke commented 5 years ago

The file is not encrypted. How to see what's the standard?

DanielRuf commented 5 years ago

You can open it in your PDF reader and display the file metadata which should also show if it is 1.7 or another one (like X/4). See https://github.com/smalot/pdfparser/issues/32#issuecomment-367724565 and https://github.com/smalot/pdfparser/issues/32#issuecomment-368133629

ivanwitzke commented 5 years ago

It shows "PDF-1.7"... so the solution for now is to save as a lower version, correct?

DanielRuf commented 5 years ago

Exactly.

adjenks commented 5 years ago

@DanielRuf If you have a fix, can you create a pull request with the edits to save @smalot some work?

DanielRuf commented 5 years ago

Hi @adjenks,

I have no time to create a patch or PR. Some people presented a few solutions here so feel free to pick these solutions, test them and create a PR.

adjenks commented 5 years ago

No problem. Thank you for getting back to me. Perhaps I will make a PR when I find some time and another pdf that throws the same error.

tacituseu commented 4 years ago

Encountered in "PDF-1.7" file with Producer: "Microsoft: Print To PDF". Reason is off-by-one error in "Cross-Reference Table" of the file that doesn't let it get past sanity check in TCPDF_PARSER::getIndirectObject().

How to verify:

  1. search for "xref" string/line near the end of the file, should look something like this:
    xref
    0 51
    0000000000 65535 f
    0000106036 00000 n
    ...
    0000106085 00000 n
  2. get the offset from the first 0 padded column of the second (or any other) entry (ending with n) and look it up in the file, should point to a start of a string like: "1 0 obj"
  3. notice it is instead pointing at newline character just before it: "\n1 0 obj"

Ref: section 3.4.3 "Cross-Reference Table" of pdf_reference_1-7.pdf

Fix/workaround:

--- a/tcpdf/tcpdf_parser.php    2020-02-14 15:20:12 +0100
+++ b/tcpdf/tcpdf_parser.php    2020-04-05 18:55:59 +0200
@@ -682,7 +682,7 @@ class TCPDF_PARSER {
        }
        $objref = $obj[0].' '.$obj[1].' obj';
        // ignore leading zeros
-       $offset += strspn($this->pdfdata, '0', $offset);
+       $offset += strspn($this->pdfdata, "0\n", $offset);
        if (strpos($this->pdfdata, $objref, $offset) != $offset) {
            // an indirect reference to an undefined object shall be considered a reference to the null object
            return array('null', 'null', $offset);
DanielRuf commented 4 years ago

@tacituseu thanks for the analysis and workaround.

It would be great if you could provide a PR.

tacituseu commented 4 years ago

@DanielRuf: It is unlikely to be a generic solution, the issue is not in pdfparser but the lib it depends on, and even then it's not really lib's fault but non-compliant generators. There are likely more types of whitespace characters that could appear there and also many other places along the way where they could cause trouble. Checked https://github.com/tecnickcom/TCPDF/issues before and there is none open for it there, hence posted here, not willing to commit to working on generic solution.

DanielRuf commented 4 years ago

That is very unfortunate - because we can patch this using composer-patches.

The issue started with 1.7 files which is also caused by Adobe PDF creators afaik.

Chandlr commented 11 months ago

"Exception Missing catalog" still in bug in [v2.8.0-RC2]

Chandlr commented 8 months ago

"Exception Missing catalog" still in bug in [v2.8.0-RC2] and in v2.9.0 i'll include a 8,85kb pdf file regarding the "PDF Problem: Missing catalog.". mangs.pdf Hope it will help :)

GreyWyvern commented 8 months ago

"Exception Missing catalog" still in bug in [v2.8.0-RC2] and in v2.9.0 i'll include a 8,85kb pdf file regarding the "PDF Problem: Missing catalog.".

Thanks for this. It looks like this document is password encrypted, so setting setIgnoreEncryption(true); is required to display the 'missing catalog' error. So I'm not sure it's precisely the same issue. Maybe it is? Regardless, the output is gobbledegook.

The earlier test document from https://github.com/smalot/pdfparser/issues/32#issuecomment-310410108 now parses correctly. However, the one from https://github.com/smalot/pdfparser/issues/32#issuecomment-236376751 still displays the missing catalogue error. Edit: This second PDF is a scan of a document and so contains no text, however it should at least have a "Page" or "Pages" object which would prevent this error from appearing... Shouldn't it?

I'll dig into this and see what I can find out.

G43beli commented 1 month ago

@GreyWyvern I can confirm that this happens on PDFs which are scans and have no text