Feature/fork request: server-side PDF parsing, exposing text with properties

I know there's been a lot of discussion about porting pdf.js to node.js or some other server-side platform so that PDF rendering can be done on the server; what I'd like to propose is a port or fork or whatever of the portion of pdf.js that parses the PDF content, separate from the portion that then goes on to render it.

Let me give an example of a real-world use case. Imagine you build a site and you want users to upload a PDF file, which then needs to get processed server-side and important data pulled out. For example say your site asks users to upload invoices, such as this example: http://cloud.github.com/downloads/commondream/payday/example.pdf. Most invoices follow a general common format, though rarely are they identical. So say I allow a user to upload that PDF invoice, and I want to write some server-side code that finds and extracts the seller address and the buyer address. Simply pulling out all the text into one long string isn't good enough; my code also needs the X-Y coordinates of each block or line of text, so that I can see which address is close to the 'Bill To' text and which address is close to the 'Ship To' text, to properly identify each address.

So what I'm asking for is a node.js or other server-side module that takes a PDF file as input and returns an object or set of objects representing all the blocks of text on each page of the PDF, including each block's properties such as text content, X-Y coordinates, font size/color and so on. I assume the code to do this is already within pdf.js somewhere? Can it be separated from the rendering engine? How?

Assuming you don't care about images and fonts, you could take the contents of src/ and strip the code from all the references to the different font and image parsing functions. As far as I know the PDF file format can be read without ever actually understanding their contents so it should be relatively simple to strip it from all the calls to canvas.js, fonts.js, etc.

In fact, you probably only want the code relative to the generation of <div>s for the text layer. But be warned: The spacing of the text does not always match the font's spacing so the text layer is often shorter or longer than the graphical representation. (XY coords of the top-left corner should work I think).

I would try to help you with this but I don't really know how the parsing stuff works (I'm just a contributor who likes to help around sometimes).

Hi, thanks for your comment. I don't really understand the code well enough to do this myself either :) I was hoping maybe one of the devs could point me in the right direction, or if enough people consider this functionality to be potentially useful maybe it could be developed as a fork or a new feature of pdf.js. For example I'm not sure how the workflow goes right now, but I can imagine a new workflow that goes something like this:

PDF input
parsing
rendering to screen OR returning parsed content as an object

To me the code reads as a bit of a jumble, with the rendering interspersed within the parsing from what I can tell, but I'm nowhere near the same league as the team building this so I hope I'm wrong. My point though is that I wonder if enough people might find it useful to have that last option—the parsed PDF content returned to the calling script as a well-organized, useful object that can be easily analyzed—then perhaps the devs might consider structuring pdf.js in such a way that that new output mode could be added. Would anyone else find this worthwhile? Does anyone who really understands pdf.js mind pointing us in the right direction for how to begin achieving it?

Hi. The server-side "rendering" (actually a JS program generation) would be also great for commenting system. Say, you have some PDF and want users to add their commetns linked to some {page,X,Y} point. Server-side code could analyze these comments (stored in a DB) and place additional

s with comment text or link. Probably only text layer would be enough - not entire canvas stuff.

Sorry, It sould be "...place additional divs with..." - forgot about tags parsing

You can edit your comments. The button appears when you hover the mouse over your post, at the top-right corner. Use < and > instead of < and > and you'll get what you wanted it to look like.

EDIT: Also, if you do, its nice to leave some comment about the edits, I like to add [EDIT: message] at the end if the edit isn't adding more content, and the way I did it here if it does. Also fixed some punctuation and clarified my meaning.

Hi,

I also see the value of parsing PDF on the server side, and about to start porting pdf.js to node.js. I just want to share some technical notes here.

My use case is to create a JSON representation of a PDF form from server, in order to enable client apps (mobile tablet, web, windows and mac) to render interactive PDF forms based on the same concise text-based data source in JSON. A sample JSON data can be something like this:

{ pdfform: { metadata: {tile: "", author:"", createDate:"", modDate:""...}, id: {name: "", copy: ""...}, width:1234, pages:[ { height: 5678, fills: [{x:0, y:0, w:10, h:10, clr: 1}, ...], lines: [{...},], texts:[{...}], fields:[{....}] }, {....} ] }

I don't have plan to support PDF images in JSON, will leave it out for now. In addition to handle pdf.js' global objects (like PDFJS and globalScope) in a node module's scope, I also have to deal with some pdf.js dependencies that only available in browsers, including:

XHR Level 2: I don't need XMLHttpRequest to load PDF asynchronously in node.js, so replaced it with node's fs (File System) to load PDF file based on request parameters;
DOMParser: pdf.js instantiates DOMParser to parse XML based PDF meta data, I used xmldom node module to make it work. xmldom can be found at https://github.com/jindw/xmldom;
Web Wroker: pdf.js has "fake worker" code built in, not much works need to be done, only need to stay aware the parsing would occur in the same thread, not in background worker thread;
Canvas: in order to keep pdf.js code intact as much as possible, I decided to create a HTML5 Canvas API implementation in a node module, I named it PDFCanvas, it has the same API, so no change in pdf.js' canvas.js, when 2D context API invoked, PDFCanvas just write it to a JS object based on the JSON format above;
Fonts: no need to call ensureFonts to make sure fonts downloaded, only need to parse out font info in CSS font format to be used in JSON's texts array.
DOM: all DOM manipulation code in pdf.js are commented out, including creating canvas and div for screen rendering and font downloading purpose.

What above is my initial "porting plan", the effort is still on-going, both fs and xmldom parts are working with a RESTful service built with node 0.8.11 and restify 1.4.4. The Fonts and Canvas parts are currently in development. If you have some thoughts/advices or feedbacks to share, I'd love to hear them.

I know there's been a lot of discussion about porting pdf.js to node.js or some other server-side platform so that PDF rendering can be done on the server; what I'd like to propose is a port or fork or whatever of the portion of pdf.js that parses the PDF content, separate from the portion that then goes on to render it.

You are welcome to open a port or fork. Sounds like great idea.

Not sure why its a pdf.js issue. Closing it?

You are right, it's not a pdf.js "issue", just an idea to extend the use case scope of pdf.js. Please close it. BTW, base on the tech notes in my last post, the read-only content in PDF parsing via node.js is very close to work, once it passes my tests, I'll share it by opening a port or fork.

Hi, yes it was a feature request, not an "issue." Modesty I would very much love to take a look at your code whenever you're comfortable sharing it :)

If I could make one comment, I see in your plan above that you intend to leave out all DOM stuff; I would strongly encourage you not to discard the basic attributes of each text box, like x-y coordinates and font size/weight/style etc. Those properties (especially x-y) are vital for parsing a document with an expected, standardized layout, for example an invoice PDF output by a program like QuickBooks. I could write a function to parse an input where I know the payee is always the text box around 20px left and 300px down, and so on. That's very valuable data that would come in handy in most parsing cases, I would think.

hi,

I've just completed the "modulization" work for pdf2json that enables PDF parsing on the server with node.js 0.8.14, it's registered with NPM:

https://npmjs.org/package/pdf2json

And also on GitHub:

https://github.com/modesty/pdf2json

This initial commit focuses on "read-only" content in PDF, including line, fills, colors, font styles and texts. Interactive form element parsing will be a future effort.

The reason to leave out all DOM manipulations is simply no DOM on server, we're parsing and rendering PDF in "memory" (JS object), rather than in browser's DOM. When I start to parse interactive forms, including text boxes, radio buttons, dropdowns, push buttons, check boxes, etc, all basic attributes (position, size, data, etc.) will sure be kept.

Awesome. There is also a solution provided in #1664, https://github.com/jviereck/node-pdfreader . Closing as won't fix for pdf.js.

Let me give an example of a real-world use case. Imagine you build a site and you want users to upload a PDF file, which then needs to get processed server-side and important data pulled out.

This task can be accomplished without uploading PDF to the server: As in main viewer demo, read/parse the selected PDF data and the client and send to the server only required data.

I created a similar server side PDF reader for Meteor: https://github.com/peerlibrary/meteor-pdf.js

mozilla / pdf.js

Feature/fork request: server-side PDF parsing, exposing text with properties #1815