vaites / php-apache-tika

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
MIT License
116 stars 22 forks source link

Provide encoding check for a document #15

Closed svaningelgem closed 5 years ago

svaningelgem commented 5 years ago

For now I can see my encoding is under getMetaData['meta']['Content-Encoding'].

But I saw in Tika, there is something called EncodingDetector.

Maybe we could use it to check the Encoding of a given data set?

vaites commented 5 years ago

I don't understand your question. There's no API call for encoding only on Apache Tika server so the only way to detect it is throug getMetaData() method...

You can use it with any data. If is in a variable, just put in a temporary file and pass its path to Apache Tika client...

svaningelgem commented 5 years ago

I checked the URL you gave, and it seems it's not explicitly exposed. (but it is available through /meta)

So my question evolves :-) : Is it possible to add an additional variable $encoding within the Metadata \Vaites\ApacheTika\Metadata class that is filled up by the "Content-Encoding" field?

Thanks!

svaningelgem commented 5 years ago

patch.txt

vaites commented 5 years ago

Thanks @svaningelgem, will take a look in all three issues ;)

vaites commented 5 years ago

Fixed in 4b4be8c4cae30c71f7dc9e9bc61c2693e0f9651d