luochen1990 / nodejs-easy-pdf-parser

a lightweight, promise style, functional wrapper of pdf2json, extract text from pdf easily
Apache License 2.0
5 stars 1 forks source link

(node:3972) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'Pages' of undefined #1

Open Masterxilo opened 2 years ago

Masterxilo commented 2 years ago

with the attached pdf

das 1-mal-1 1x1 des anlegens brochure-1x1-anlegen-de.pdf

sudo npm install -g easy-pdf-parser
pdf2text 'das 1-mal-1 1x1 des anlegens brochure-1x1-anlegen-de.pdf'
(node:299) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'Pages' of undefined
    at extractText (/usr/lib/node_modules/easy-pdf-parser/src/easy-pdf-parser.js:22:38)
    at extractPlainText (/usr/lib/node_modules/easy-pdf-parser/src/easy-pdf-parser.js:43:20)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
(Use `node --trace-warnings ...` to show where the warning was created)
(node:299) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:299) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

another program can read it just fine, e.g.

pdftotext 'das 1-mal-1 1x1 des anlegens brochure-1x1-anlegen-de.pdf' -
Masterxilo commented 2 years ago

see https://stackoverflow.com/questions/7719499/print-contents-of-a-pdf-to-the-command-line/51337079?noredirect=1#comment124056295_51337079

luochen1990 commented 2 years ago

I can run this successfully.

$ node --version  
v10.19.0

and the tail of this pdf document's text content is

----- page 31 -----

22803_Broschuere_1x1Anleger_d_2015_08_19.indd   2 21.12.15   16:57 22803_Broschuere_1x1Anleger_d_2015_08_19.indd   31 21.12.15   16:57

----- page 32 -----

Diese Publikation dient nur zur Information. Sie ist weder als 
Empfehlung, Offerte oder Aufforderung zur Offertstellung 
noch als Rechts- oder Steuerberatung zu verstehen. Sie sollten 
sich professionell beraten lassen, bevor Sie eine Entscheidung 
treffen. UBS behält sich das Recht vor, Dienstleistungen, 
Produkte und Preise jederzeit ohne Vorankündigung zu ändern. 
Einzelne Dienstleistungen und Produkte unterliegen rechtli-
chen Restriktionen. Sie können deshalb nicht uneingeschränkt 
weltweit angeboten werden. Die vollständige oder teilweise 
Reproduktion ohne ausdrückliche Erlaubnis von UBS ist untersagt. 
UBS Switzerland AG
Postfach
8098 Zürich
© UBS 2020. Das Schlüsselsymbol und UBS gehören zu den geschützten Marken von UBS. Alle Rechte vorbehalten. August 2020. 84219D

I don't know why it doesn't work for you yet...

Masterxilo commented 2 years ago

I see @luochen1990 . I have a different node version

$ node --version
v14.16.0
$ npm --version
6.14.11
luochen1990 commented 2 years ago

That is weird, from your stacktrace info, it seems that there is no Pages attribute here, and this seems impossible to be a nodejs version compatibility issue... hmmm

Robak08 commented 2 years ago

@luochen1990 I'm running into similar problem, seems to be after pdf2json dependency minor version update to above 1.2.3 (1.3.1atm).

I believe it's connected with https://github.com/modesty/pdf2json/issues/249#issuecomment-953029255

data structure

//before
pdfData.formImage.Pages
//after
pdfData.Pages

biggest issue is that certain small svg elements aren't visible anymore in Fills Array for Pages Object. Those were present with version 0.0.4.

//edit

I've been also testing different pdf2json versions with node v14.16.0 and v16.14.2. Few observations:

pdf2json v1.2.1 - v1.2.4:

pdf2json v1.3.1 and v2.0.1: