vaites / php-apache-tika

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
MIT License
114 stars 22 forks source link

Implement /rmeta/text #16

Closed svaningelgem closed 5 years ago

svaningelgem commented 5 years ago

It would be nice to have access to the /rmeta/text call. This retrieves the meta data as well as the text content.

Why this is important is when you have tesseract installed, tika will use it. So retrieving the metadata of one file I have here (an image) takes 76s. Retrieving the text ALSO takes 76 seconds. [ languages= nld+eng+fra+deu+chi_sim+chi_tra]

A curl test confirms this.

Now however if I use /rmeta/text, I get all the same informations, but within this same 76s.

So it'd be very nice to have some way to get rmeta

Tests:

time curl -T /root/gmail_backup/mails/938/368/133/922/520/1598520922133368938.03 http://localhost:9998/tika --header "X-Tika-OCRLanguage: nld+eng+fra+deu+chi_sim+chi_tra"

real    1m16.375s
user    0m0.008s
sys     0m0.008s

time curl -T /root/gmail_backup/mails/938/368/133/922/520/1598520922133368938.03 http://localhost:9998/meta --header "X-Tika-OCRLanguage: nld+eng+fra+deu+chi_sim+chi_tra"

real    1m16.356s
user    0m0.008s
sys     0m0.004s

time curl -T /root/gmail_backup/mails/938/368/133/922/520/1598520922133368938.03 http://localhost:9998/rmeta/text --header "X-Tika-OCRLanguage: nld+eng+fra+deu+chi_sim+chi_tra"

real    1m16.247s
user    0m0.012s
sys     0m0.004s

Related to this I'll open an improvement request ;-)

svaningelgem commented 5 years ago

patch.txt

vaites commented 5 years ago

I added a second parameter to getMetadata method that fills the content property on the Metadata instance returned. Hope it solves your problem...