vaites / php-apache-tika

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
MIT License
114 stars 22 forks source link

Error with file name containing foreign characters #28

Closed gregoriopellegrino closed 3 years ago

gregoriopellegrino commented 3 years ago

I realized that the command comes out with error code 1 when I ask to extract the text of a file called 07.-아름다운GIRL-Who-are-You-Official-Lyrics.pdf

I did some debugging and the problem seems to be related to the lack of language setting in the environment.

I modified the file src/CLIClient.php in lines 316 and later, with the code:

$env = array(
     'LANG' => 'en_US.UTF-8'
);
$process = proc_open($command, $descriptors, $pipes, $env);

Now it works properly.

vaites commented 3 years ago

Thanks @gregoriopellegrino, I can simply add a new parameter to set this value. Will make some tests too...

Anyway, can you please tell me what's the operating system and version you're using?.

gregoriopellegrino commented 3 years ago

CentOS 7

vaites commented 3 years ago

OK, it's enough for my tests and will keep you informed. One last question: what's your actual $LANG value?. In theory, the library must use all env variables of the user that runs the command. How are you running it?

gregoriopellegrino commented 3 years ago

In the CLI I have en_US.UTF-8, so I set the same in the script since I've seen that in CLI the command worked with no errors.

vaites commented 3 years ago

My assumptions seems to be wrong.

vaites commented 3 years ago

I've just uploaded to the 1.x branch the changes to allow users to set its own environment variables:

$client = Client::make('/path/to/tika-app.jar');
$client->setEnvVars(['VARIABLE' => 'value']);

Can you please check if it works for you (set dev-1.x as version on your composer.json). It still needs some testing but i think it must work without issues.

vaites commented 3 years ago

Will release soon the 1.1 version with this feature...

vaites commented 3 years ago

Fixed with v1.0.2 release