sambitdash / PDFIO.jl

PDF Reader Library for Native Julia.
Other
128 stars 13 forks source link

Request for examples in the documentation #38

Open johannspies opened 5 years ago

johannspies commented 5 years ago

I am struggling to figure out how to use this library to read a pdf as text for the purpose of Natural Language Processing as an alternative to

using Taro
Taro.init()
meta, txtdata = Taro.extract(files[1]);

as shown in https://github.com/aviks/nlp-workshop/blob/master/NLP-in-julia.ipynb

Or can I not use this library in stead of Taro (which I cannot compile on Julia 1.0.2)?

sambitdash commented 5 years ago

PDFIO is a little low level API than Taro in this respect. It deals with PDF each page separately. So you may need a few extra lines of code. The piece of code you are looking for is the following:

function getPDFText(src, out)
       doc = pdDocOpen(src)
       docinfo = pdDocGetInfo(doc)
       open(out, "w") do io
               npage = pdDocGetPageCount(doc)
               for i=1:npage
                     page = pdDocGetPage(doc, i)
                     pdPageExtractText(io, page)
               end
       end
       pdDocClose(doc)
       return docinfo
end

If you still face any issue or challenges with the code please let us know so that we can try to address those.

The library is kept very flexible for accessing detailed query into PDF objects. A summary level API or samples will definitely help for someone to get some quick tasks done as well. We will keep that in mind to add a few examples and samples in the documentation.

johannspies commented 5 years ago

Thanks! That helps.