Open johannspies opened 5 years ago
PDFIO
is a little low level API than Taro in this respect. It deals with PDF each page separately. So you may need a few extra lines of code. The piece of code you are looking for is the following:
function getPDFText(src, out)
doc = pdDocOpen(src)
docinfo = pdDocGetInfo(doc)
open(out, "w") do io
npage = pdDocGetPageCount(doc)
for i=1:npage
page = pdDocGetPage(doc, i)
pdPageExtractText(io, page)
end
end
pdDocClose(doc)
return docinfo
end
If you still face any issue or challenges with the code please let us know so that we can try to address those.
The library is kept very flexible for accessing detailed query into PDF objects. A summary level API or samples will definitely help for someone to get some quick tasks done as well. We will keep that in mind to add a few examples and samples in the documentation.
Thanks! That helps.
I am struggling to figure out how to use this library to read a pdf as text for the purpose of Natural Language Processing as an alternative to
as shown in https://github.com/aviks/nlp-workshop/blob/master/NLP-in-julia.ipynb
Or can I not use this library in stead of Taro (which I cannot compile on Julia 1.0.2)?