vaites / php-apache-tika

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
MIT License
114 stars 22 forks source link

parse pdf per page #18

Closed harjedmor closed 5 years ago

harjedmor commented 5 years ago

is it possible to parse some page of pdf file to text?

vaites commented 5 years ago

@harjedmor is not possible with Apache Tika directly, because is not it's goal. You need to cut the PDF file first, and then use this library to extract the text. To do this, there are a few libraries on composer or tools like pdftoolbox or pdftk

There's a trick but I think that is not accurate or will change in the future without warning.