modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
1.97k stars 376 forks source link

pdf2json Performance over large PDF #70

Open barneydunning opened 8 years ago

barneydunning commented 8 years ago

Hi All,

I have a PDF file that contains about 500 pages (3.6mb) - I can't post because it contains sensitive data. When I load it up through pdf2json, it takes about 10 minutes to fire the dataReady callback... is this expected?

I am running the node application on an macbook pro, i7, 16GB... and seriously expected it to be faster.

The PDF contents are of a timetable nature... and all I want to extract are the text strings and their x/y locations for grouped by page.

Does anyone else have performance issues with pdf2json... or does anyone else have any suggestions as to other node modules to use for this purpose?

Looking forward to some help... and free to answer any questions.

Ta.

modesty commented 8 years ago

the biggest pdf files in unit tests are under 8 pages, never tested it with 'large' file. If performance is an issue, I'd recommend to split it into smaller ones before parsing, since smaller pdfs are well tested and well performed.

barneydunning commented 8 years ago

Hi there... thanks for the reply.

With so many downloads I am surprised no one else has hit this issue. The PDF files that we need to import are outwith our control, so we cannot lessen their size. They can be anything from one page to 1500 pages.

Are there any input options that cuts down the amount of work this plugin does when preparing the data? The only information we require is the textual data along with it's x and y coordinates.

Looking forward to your response.

Many thanks, Barney

modesty commented 8 years ago

one option is to update the stream implementation from file to page, so the process starts to flow when a single page data is ready. It would improve responsiveness, but won't reduce the total processing time for large PDFs.

barneydunning commented 8 years ago

Yep that's a shame. I take it there is no way of speeding up the process by limiting what it ends up outputting? So for example, asking it to only do specific types of work when loading the PDF document.

What would be the cause of the slowness... is it string manipulation or something similar in the inner workings of the module?

kishorsharma commented 8 years ago

We can use child process to process pages parallel. This will not only improve responsiveness but also reduce time for such large file. I would love to contribute and create PR for it if you think the same.

AshishGogna commented 8 years ago

I don't seem to have this issue. I have tried parsing 11mb PDF, and the dataReady callback fires in under a minute.

I am running the node application on my macbook pro, i5, 8GB.

Here's the PDF that i tested - https://drive.google.com/file/d/0BzR-ZOIycHumX3hsbTVWbFMyQlU/view?usp=sharing

barneydunning commented 8 years ago

Sorry for the delay... damn holidays huh?! Well I am back now, so here goes...

Although the PDFs I am using are only ~4mb, each page (~1,300 pages) have a grid of tabular data (about 8x8)... and some "cells" can have up to 6x text items in - vertically placed. So it might not be about the size of the PDF, but rather the contents and their structure.

kishorsharma - if you could look into speeding this up using child processes, then I would be happy to test your code. Any advance on 10 minutes would be a big bonus!

Please let me know your thoughts.

wanghaisheng commented 7 years ago

anything update?