vaites / php-apache-tika

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
MIT License
116 stars 22 forks source link

TXT file (text/plain) : Unsupported media type ? #11

Closed tanshiqi closed 5 years ago

vaites commented 5 years ago

Sorry for the delay. I can reproduce this problem, so working on it. Will release a new version ASAP with a bugfix.

vaites commented 5 years ago

I tried on Windows and fails, but works well on Linux. Can you tell me what is your operating system and version?

Anyway, I assume you are using the Tika server, not the command line JAR, right?

tanshiqi commented 5 years ago

I use tika server from a docker container: LogicalSpark/docker-tikaserver, Version 1.2. It's runing on a ubuntu server.

vaites commented 5 years ago

Thanks, will try with Docker

vaites commented 5 years ago

I installed the docker container and all works well. I installed it on a Ubuntu Server 18.04 without problems. I requested also using cURL from command line as specified here:

curl -X PUT --data-binary @foo.txt http://localhost:9998/language/stream

Can you try using this command to discard a possible problem not related with this library?.

Anyway, the problem is reproducible on Windows, I'm working on it...

vaites commented 5 years ago

I'm sorry but I configured can't reproduce the problem again. I tried in multiple devices, using Windows, Linux and Docker as servers and none gives me this error.

Can you tell me more info, please?. What are you trying to do with the txt file? What's the size and encoding? Are you trying to use a remote document or a local one?

tanshiqi commented 5 years ago

Thanks so much. I can reproduced the problem when the file encoding is GB2312 while UTF-8 is OK.

tanshiqi commented 5 years ago

Example file: readme.txt

vaites commented 5 years ago

I made some tests and I think is an Apache Tika related bug. Using the file you uploaded, I always get the same error: javax.ws.rs.WebApplicationException: HTTP 415 Unsupported Media Type. I tried with the server on Windows, Linux and the Docker container you're using. You can try yourself with this command:

curl -T readme.txt http://localhost:9998/meta

But if I use other endpoints (like language) the library (and the server) returns the detected language.

The library only returns the error thrown by Apache Tika, so I think I can't do anything more than recommend you to open a bug in Apache page. If you think I'm wrong, I'm opened to hear other ideas...