Open geraldmwanyika opened 10 years ago
Can you show the PDF file?
I have a similar problem.
Here you can find a broken pdf: https://github.com/jaberu/PhpOfficeUtils/blob/master/test/resources/missing-catalog.pdf
I have the same problem parsing a PDF. The exception is thrown in the line 246 inside the file src\Smalot\PdfParser\Document.php with the message Missing catalog.
I have the same trouble, when parsing with pdfparser. Fatal error: Uncaught exception 'Exception' with message 'Missing catalog.'
It's from document.php the last line: throw new \Exception('Missing catalog.'); (in public function getPages() )
So, does anybody know, how to make a workaround?
Same Problem
Hi
Me also but only withg one long and heavy PDF other even multiple pages are working ok
@renaudham it would be great to have some testfile which throws this exception,
Hi
here is one attached.
I have analyzed Document.php and it seems that the Dictionnary returns empty because there is no TYPE returned here
protected function buildDictionary() { // Build dictionary. $this->dictionary = array();
foreach ($this->objects as $id => $object) {
$type = $object->getHeader()->get('Type')->getContent();
**// var_dump($type);**
if (!empty($type)) {
$this->dictionary[$type][$id] = $id;
}
}
}
so for my usage I simply modified few lines
public function getPages() { ... return array(); throw new \Exception('Missing catalog.'); ...
public function getText(Page $page = null)
{
... if(count($pages)==0){ return false; }
like that when I call a doc with this issue (no types) I will receive a "false" that I can use to swicth to different unparsable no content treatment (as of course I will get zero content from this pdf)
thanks
the foreach ($this->objects as $id => $object) { $type = $object->getHeader() .... (cut the rest to get type and content)
with getHeader only return object elements
but it seems there is not "type" but also zero content extracted in the $this->objects
I added this as a question on Stackoverflow to see if we could get some help.
https://stackoverflow.com/questions/48173527/continue-a-script-after-an-exception-is-thrown-php
The answer for me was to make sure my Exception was namespaced in my try/catch block.
So this: catch (\Exception $e)
, instead of catch (Exception $e)
.
It is already namespaced and not directly related to the pdfparser bug.
Same issue I am facing when trying to upload multiple pdfs, even in my script I have mentioned the namespace properly.
The namespace is not the problem but the parsing of the PDF.
Also please provide the full error/exception + stacktrace that you get.
Here you can see the stack trace. https://gist.github.com/aavrug/ee26ebc55f618b8bc93823df23470a51
i am having this same problem, is there any chance of a fix?
I have the same issue. I can tell that if I open the file (on a mac) in Preview.app and save the pdf out and try to parse it again, then it works fine.
So it is related to the tool that generates the PDF. I guess this depends on the used PDF version.
Can you check with Adobe Reader or some other tools which PDF version is used in both cases @djlift?
I just confirmed that I had a v1.6 and saved it down to to v1.3 and then I do not experience the issue.
So it is still an issue with the data structure of newer PDF versions.
I would imagine, yes. I don't really know much about the structures or differences of the different versions unfortunately.
same issue, what we need to fix this?
@smalot Are there news about the fix for this issue?
Thank you so much!
Same here, many thanks in advance for this great library.
Hello Everyone!
I was also facing the same problem of missing catalog fatal error.
I have tried try, throw and catch and now i am not getting any missing catalog fatal error. Below is the code where i have applied try, throw and catch:
` public function getPages()
{
try{
if (isset($this->dictionary['Catalog'])) {
// Search for catalog to list pages.
$id = reset($this->dictionary['Catalog']);
/** @var Pages $object */
$object = $this->objects[$id]->get('Pages');
if (method_exists($object, 'getPages')) {
$pages = $object->getPages(true);
return $pages;
}
}
if (isset($this->dictionary['Pages'])) {
// Search for pages to list kids.
$pages = array();
/** @var Pages[] $objects */
$objects = $this->getObjectsByType('Pages');
foreach ($objects as $object) {
$pages = array_merge($pages, $object->getPages(true));
}
return $pages;
}
if (isset($this->dictionary['Page'])) {
// Search for 'page' (unordered pages).
$pages = $this->getObjectsByType('Page');
return array_values($pages);
}
throw new \Exception('Missing catalog.');
}
catch(\Exception $e)
{
$pages = '0';
}
}`
Best of luck!!
@hnk15 can you provide these changes as patch file?
@DanielRuf what do you mean by patch file?
A patch file is a file use by patch tools to change files base on a diff.
See https://patch-diff.githubusercontent.com/raw/smalot/pdfparser/pull/224.patch
Yeah come on @hnk15 , I also have this bug, more of us could make great use of your patch.
I just checked, and the fix suggested suggested by @hnk15 was not added to the code yet, however I tested it and it resolved my issue. If you're not sure what to do but still have missing catalog fatal error issue, simply download the following file Font.php from your server. If you used composer to install the package, most likely it'll be located in your vendor directory, something like this: vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php
Open the file in your editor and replace the line 259, which is:
$part = pack('H*', $part);
with the following line of code:
$part = pack('H*', str_replace(' ', '', sprintf('%u', CRC32($part))));
Save the changes and re-upload the file to the same location.
You can use composer-patches for example which makes it easier to apply it =)
Even using the patch provided, its not working for me... The file I'm trying to parse is 67Mb.
PHP Fatal error: Uncaught Exception: Missing catalog. in APP_FOLDER/vendor/smalot/pdfparser/src/Smalot/PdfParser/Document.php:248
Is the PDF encrypted and which PDF standard version is used @ivanwitzke?
The file is not encrypted. How to see what's the standard?
You can open it in your PDF reader and display the file metadata which should also show if it is 1.7 or another one (like X/4). See https://github.com/smalot/pdfparser/issues/32#issuecomment-367724565 and https://github.com/smalot/pdfparser/issues/32#issuecomment-368133629
It shows "PDF-1.7"... so the solution for now is to save as a lower version, correct?
Exactly.
@DanielRuf If you have a fix, can you create a pull request with the edits to save @smalot some work?
Hi @adjenks,
I have no time to create a patch or PR. Some people presented a few solutions here so feel free to pick these solutions, test them and create a PR.
No problem. Thank you for getting back to me. Perhaps I will make a PR when I find some time and another pdf that throws the same error.
Encountered in "PDF-1.7" file with Producer: "Microsoft: Print To PDF". Reason is off-by-one error in "Cross-Reference Table" of the file that doesn't let it get past sanity check in TCPDF_PARSER::getIndirectObject().
How to verify:
xref
0 51
0000000000 65535 f
0000106036 00000 n
...
0000106085 00000 n
Ref: section 3.4.3 "Cross-Reference Table" of pdf_reference_1-7.pdf
Fix/workaround:
--- a/tcpdf/tcpdf_parser.php 2020-02-14 15:20:12 +0100
+++ b/tcpdf/tcpdf_parser.php 2020-04-05 18:55:59 +0200
@@ -682,7 +682,7 @@ class TCPDF_PARSER {
}
$objref = $obj[0].' '.$obj[1].' obj';
// ignore leading zeros
- $offset += strspn($this->pdfdata, '0', $offset);
+ $offset += strspn($this->pdfdata, "0\n", $offset);
if (strpos($this->pdfdata, $objref, $offset) != $offset) {
// an indirect reference to an undefined object shall be considered a reference to the null object
return array('null', 'null', $offset);
@tacituseu thanks for the analysis and workaround.
It would be great if you could provide a PR.
@DanielRuf: It is unlikely to be a generic solution, the issue is not in pdfparser
but the lib it depends on, and even then it's not really lib's fault but non-compliant generators.
There are likely more types of whitespace characters that could appear there and also many other places along the way where they could cause trouble.
Checked https://github.com/tecnickcom/TCPDF/issues before and there is none open for it there, hence posted here, not willing to commit to working on generic solution.
That is very unfortunate - because we can patch this using composer-patches.
The issue started with 1.7 files which is also caused by Adobe PDF creators afaik.
"Exception Missing catalog" still in bug in [v2.8.0-RC2]
"Exception Missing catalog" still in bug in [v2.8.0-RC2] and in v2.9.0 i'll include a 8,85kb pdf file regarding the "PDF Problem: Missing catalog.". mangs.pdf Hope it will help :)
"Exception Missing catalog" still in bug in [v2.8.0-RC2] and in v2.9.0 i'll include a 8,85kb pdf file regarding the "PDF Problem: Missing catalog.".
Thanks for this. It looks like this document is password encrypted, so setting setIgnoreEncryption(true);
is required to display the 'missing catalog' error. So I'm not sure it's precisely the same issue. Maybe it is? Regardless, the output is gobbledegook.
The earlier test document from https://github.com/smalot/pdfparser/issues/32#issuecomment-310410108 now parses correctly. However, the one from https://github.com/smalot/pdfparser/issues/32#issuecomment-236376751 still displays the missing catalogue error. Edit: This second PDF is a scan of a document and so contains no text, however it should at least have a "Page" or "Pages" object which would prevent this error from appearing... Shouldn't it?
I'll dig into this and see what I can find out.
@GreyWyvern I can confirm that this happens on PDFs which are scans and have no text
I am getting an exception when trying to parse PDF
Exception
Missing catalog.