modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
1.99k stars 377 forks source link

Pass in options that tell pdf2json what to output #21

Open GeoffreyBooth opened 10 years ago

GeoffreyBooth commented 10 years ago

Building off of the 'add PostScript coordinates' idea in #12 , perhaps we should support for an options object passed as a second parameter to loadPDF? And this options object could include key-value pairs like coordinates: 'PostScript' or useDictionary: false or excludeTextsProperties: ['clr', 'oc', 'A', 'R.S'] etc. This could also be an easy way to implement #20 .

I understand @modesty 's comment in closing #20 that an output format other than what he has produced is not what he had in mind, but surely we can make pdf2json more flexible to support more varied projects. I think it can produce the current output as well as others as different projects may desire. I'm happy to try to tackle this if others think it's a good idea, and I welcome ideas on the best implementation.

RST-J commented 10 years ago

I like the idea. However I have two things I consider important: First, there need to need to be reasonable defaults which allow you to use pdf2json as it was meant to be without getting a headache about whether and if so which options you have to pass to achieve this.. Second, options - where suitable - should not be exclusive. What if I was interested in getting out both, pixel and PostScript coordinates?

GeoffreyBooth commented 10 years ago

Agree on both counts. This is why I wanted to get feedback before building anything :)

Regarding the 'reasonable defaults'; perhaps if the options object is missing or equal to {legacy: true} then pdf2json returns output as it does now? That way we don't break any projects that rely on the current implementation.

And as for inclusive/exclusive, I agree, we should be able to get all possible properties returned. So instead of properties x and y and either they're one mode or another mode, we could keep x and y as they are and simply add new properties xPostScript and yPostScript. And then there would be some way to choose which properties to return, if you didn't want these new additions (or if you only want these new additions, and not the old x and y).

Speaking of which, I've been thinking about the JSON object that gets returned as output. Properties with names like A and R and clr not only aren't idiomatic JavaScript nor consistent, they're just not very descriptive; this makes it hard for me to remember what they refer to, without having to refer back to the reference. What if instead of Pages[0].Texts[0].R[0].T we had something like pages[0].texts[0].content? And of course, to maintain backward compatibility pdf2json would return the old format when the options object was missing or set to {legacy: true}. And we could borrow a page from MongoDB's fields query parameter, where options could contain something like { outputObject: { 'pages.texts': true } to return only the texts array within the pages array and nothing else; or { outputObject: { 'pages.horizontalLines': false, 'pages.verticalLines: false } } to return everything except the horizontal and vertical lines properties.

I'm thinking of going even one step further: within this outputObject parameter, let the user specify the names of the properties in the returned JSON output. For example say my long descriptive camelcased names are the default non-legacy property names. Then the options object could have something like this:

{ outputObject:
  { 'pages.horizontalLines': true,
    'pages.verticalLines': true,
    'pages.fills': 'pages.blocks', // renames 'fills' as 'blocks' within the 'pages' array
// . . . etc.

Which return only the horizontal lines, vertical lines, and fills (but in an array named blocks).

Why do something so elaborate? Two reasons: so that the output is ready to be processed and inserted into a user's database, which is easier when the property names already match the database's field names; or so that the output is ready to be processed and turned into HTML to be output on a page. I understand the desire to have short property names to reduce the size of the payload, and using this implementation that can still be achieved (even better than the current output, in fact). Other than database field names, perhaps A could become text-align or textAlign, the CSS property name, for easy insertion into a constructed HTML style attribute.

Which raises yet another point: perhaps all the attribute names that have equivalent CSS properties should have their default names as such? textAlign, color, fontWeight, fontStyle etc.? And offered as separate properties, in addition to the merged 'style index', for easier conversion into HTML (or at least, easier comprehension for anyone with an HTML/CSS background).

So a few questions:

  1. Is there any reason to keep the current super-short property names as the non-legacy defaults? Can I rename them all to be camelcased and mimicking CSS names when applicable?
  2. Do we want outputObject to behave like the MongoDB fields parameter, for quick including/excluding properties, or do want this additional renaming-properties functionality?
  3. Why is the text run an array? What's currently Pages[0].Texts[0].R—why is R (text run) an array of text runs? In my tests R is always a one-element array. Couldn't we bring up R.T and R.S one level up, at the same level as A and clr? When would R have more than one element?
  4. Is there any reason why every text run needs to have every property, even if the property is empty or an obvious default? For example maybe we could add a defaultValues object within the output, that specifies default property values like text alignment and color; so for most English-language documents those would be left and #000, and then the A/textAlign and clr/color properties would simply be missing from any text runs that were left-aligned and black. Similar again to how CSS defines properties only that differ from the defaults or the inherited styles.
  5. What are the current X and Y coordinates, if they're not PostScript coordinates? Just pixel coordinates? 'PDF Unit' or 'page unit' coordinates? Something else?
  6. What other options would you want to be able to set in an options object? Currently I've been thinking of implementing the above outputObject, as well as useDictionary (true/false). Anything else?
modesty commented 10 years ago

For output format: currently, pdf2json is deployed in production that converts ~250 PDFs daily and the resulting JSON is feeding a HTML5 renderer without big issues. I understand your intention on idiomatic JavaScript, but it's nice to have, not essential to the parser. In case you do need to have a different or improved output format, I'd suggest to create a new node module that runs on top of pdf2json, then you can get exactly what you want, like pdf2csv: https://www.npmjs.org/package/pdf2csv.

I've been trying to avoid using "option" object to drive different outputs, mainly because this "derived configuration" way really creates mess for maintainability in the long run, especially when options object grows bigger over time and having multiple valid and invalid combinations. I'm in the camp of Ruby's "convention over configuration", not a fan of Java EE style configurations and options.

What's essential to pdf2json is parsing and packaging PDF in a clean and efficient way. #18 is the biggest issue so far, it makes the output JSON bigger and causes headaches to client renderer, not clean and definitely not efficient either. If you have some bandwidth to make pdf2json better, I'd recommend to work on issues like this.

As for the code structure, you are right, "base" is where the fork of pdf.js is, with documented modifications, since I'm running it outside of browser. Please checkout the file base/display/canvas.js, methods like showText and showSpacedText, I've attempted to fix #18 by auto-merging text blocks, but didn't work well in different test cases.

What's above is just some thoughts, if you think differently, let's talk more.

GeoffreyBooth commented 10 years ago

I understand the concern about an options object adding complexity to the module. Fair point.

Here's the thing though: if there's no way to tell pdf2json what to output, then it will always need to output everything, and it will be up to a wrapper module (like pdf2csv) or the parent app itself to filter out the unneeded items. For example @modesty needs the pixel coordinates and @RST-J needs PostScript coordinates; well, if there's no way to tell pdf2json which to output, it will need to output both sets of coordinates so that both @modesty and @RST-J get what they need and then it's up to each user's app to filter out the properties it doesn't need. Likewise the dictionary would have to be scrapped entirely; many apps would need the original font face etc. data that the dictionary would be eliding, and if there's no way to tell pdf2json whether or not to use the dictionary, it would have to never use the dictionary and always output the verbose version of all that data (or even worse, output all that detail and the dictionary short version).

Personally I think it makes more sense to put this "reformat the output" code as one standalone function within pdf2json, as a final step before returning the output object to the caller. You can still run all your tests as is, because if no options object was passed into pdf2json it would return its default output format. That's still "convention over configuration"—pdf2json has a default, and without modification it sticks to that default. Only when a configuration is specified does it deviate from the convention.

If you agree with me that we should add this functionality, then we should go on to specify its design; what options should we allow, etc.

As a separate question, what do you guys think about refactoring the output to be in idiomatic JavaScript? Things like horizontalLines instead of HLines, and so on. If we scrap the renaming idea, this would become a breaking change in the API (which probably isn't so horrible, as it's not like pdf2json has a huge userbase and current apps could always continue to just keep using the current version). Or if we add an options object, there could be an option to keep the legacy output property names, even if we don't allow arbitrary renaming.

And @modesty I understand your feelings about #18, I'd love to get that fixed too :) I have to deal with this for my app, though, while I don't necessarily need to fix #18 for my app to work. I'll definitely devote my attention to it after I get this output format stuff sorted.