Open SNTRM-G opened 9 years ago
I found out the issue is with the PDF version. PDF 1.7 won't parse with this version in most cases.
In my case, I converted the version down to 1.4 and then extracted the text.
Refer: https://github.com/xthiago/pdf-version-converter
Although I don't want to use this. Please help me if it's possible to directly extract text from PDF 1.7
Hi ! Just 1 year late :/
I face this same problem but when I check the PDF document version via
use \Xthiago\PDFVersionConverter\Guesser\RegexGuesser;
$guesser = new RegexGuesser();
$version = $guesser->guess($abspath)
I get $version = 1.3 so, so I first thought it was not the same trouble. Anyway, I do the conversion and it actually works ! So simply thanks alonemayank. Although Code is a bit ugly, at least it is functional
public static function readPDF($abspath, $nonl = false) {
$guesser = new RegexGuesser();
$version = $guesser->guess($abspath);
Log::debug("Documentor->readPDF about to parse $abspath version $version");
$parser = new \Smalot\PdfParser\Parser();
try {
$pdf = $parser->parseFile($abspath);
} catch (\Exception $e) {
//Tricky but it works. Do not convert others or getting the text from them won't work right
if(preg_match('/.*TCPDF_PARSER ERROR: Invalid object reference:.*/i',$e->getMessage())) {
$command = new GhostscriptConverterCommand();
$filesystem = new Filesystem();
$converter = new GhostscriptConverter($command, $filesystem);
$converter->convert($abspath, '1.4');
$version = $guesser->guess($abspath);
Log::debug("Documentor->readPDF must CONVERT $abspath TO version $version");
$pdf = $parser->parseFile($abspath);
} else {
throw $e;
}
}
Hello, any news?
I tried with 6 different files, with different PDFs versions:
TCPDF_PARSER ERROR: Invalid object reference: Array
Missing catalog.
I believe it is bad formatting of the files. But the exceptions don't explain for me what a need to change.
Update:
Update2:
The file format (header, trailer, objects, xref, streams) is corrupted.
The document doesn't conform to the PDF reference (missing required entries, wrong value types, etc.).
The document does not conform to the PDF 1.3 standard.
Context: smalot/pdfparser v0.9.25, tecnickcom/tcpdf 6.2.12, PHP 5.4.6.0; Windows 10 (64 x64).
Hi We are unable to process any pdf file. We tested with the demo page (http://www.pdfparser.org/demo) and it parses OK. But when we try $parser = new \Smalot\PdfParser\Parser(); $pdf = $parser->parseFile($pdf_file_path); it always fails with the following error: PHP Fatal error: Uncaught exception 'Exception' with message 'TCPDF_PARSER ERROR: Invalid object reference: Array' in T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\tecnickcom\tcpdf\tcpdf_parser.php:807 Stack trace:
0 T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\tecnickcom\tcpdf\tcpdf_parser.php(680): TCPDF_PARSER->Error('Invalid object ...')
1 T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\tecnickcom\tcpdf\tcpdf_parser.php(286): TCPDF_PARSER->getIndirectObject('', '173', true)
2 T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\tecnickcom\tcpdf\tcpdf_parser.php(195): TCPDF_PARSER->decodeXrefStream('173', Array)
3 T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\tecnickcom\tcpdf\tcpdf_parser.php(117): TCPDF_PARSER->getXrefData()
4 T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\smalot\pdfparser\src\Smalot\PdfParser\Parser.php(88): TCPDF_PARSER->__construct('%PDF-1.3?%?????...')
5 T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\smalot\pdfparser\src\Smalot\PdfParser\Parser.php(74): Smalot\PdfParser\Parser->parseContent('%PDF-1.3?%?????...') in T:\Projecto\PHP\PDF_parsing\PDFParser\vendor\tecnickcom\tcpdf\tcpdf_parser.php on line 807
Could you help? Thanks.