modesty / pdf2json

converts binary PDF to JSON and text, for server-side PDF processing and command-line use.
https://github.com/modesty/pdf2json
Other
1.98k stars 378 forks source link

Expose pdf.js getTextContent method for a pdf page #19

Closed yveszoundi closed 10 years ago

yveszoundi commented 10 years ago

Could it be possible for you to expose the getTextContent method via let's say a Content property to get easily a page raw text?

Use Case

Proposed Implementation Add a Content property in pdf.js.

var page = {Height: pageParser.height,                                                                                                             
                 HLines: pageParser.HLines,                                                                                                                     
                 VLines: pageParser.VLines,                                                                                                                     
                 Fills:pageParser.Fills,                                                                                                                        
                 Content:pdfPage.getTextContent(),                                                                                                              
                 Texts: pageParser.Texts,                                                                                                                       
                 Fields: pageParser.Fields,                                                                                                                     
                 Boxsets: pageParser.Boxsets                                                                                                                    
             };                    

If there's another approach that deals with funky characters easily without introducing an API add-on, I'd be glad to hear about it.

modesty commented 10 years ago

Please refer to: Pull Request #20: Expose PDF.js getTextContent method via a Content property.