vaites / php-apache-tika

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
MIT License
114 stars 22 forks source link

Fix extract text from url with DownloadRemote #36

Closed denistorresan closed 3 months ago

denistorresan commented 4 months ago

Hi, I found a potential bug when try to use Tika to parse URL's. My code is the following, using TIKA v.2.9.x via docker:

<?php

require 'vendor\autoload.php';

$client = \Vaites\ApacheTika\Client::make('localhost', 9998);
$client->setDownloadRemote(true);

//it works
$document = $client->getText('./LIB0194021_A4-LHOV.pdf');

var_dump( $document );

//doesn't works
$document = $client->getText('https://arxiv.org/pdf/1910.13461.pdf');

var_dump( $document );

//doesn't works
$document = $client->getMainText('https://arxiv.org/archive/astro-ph');

var_dump( $document );

the error is the follow:


PHP Fatal error:  Uncaught Exception: Unprocessable document in C:\WORKAREA\Projects\cloudconversa\research\tika\vendor\vaites\php-apache-tika\src\Clients\WebClient.php:642
Stack trace:
#0 C:\WORKAREA\Projects\cloudconversa\research\tika\vendor\vaites\php-apache-tika\src\Clients\WebClient.php(556): Vaites\ApacheTika\Clients\WebClient->error()
#1 C:\WORKAREA\Projects\cloudconversa\research\tika\vendor\vaites\php-apache-tika\src\Client.php(389): Vaites\ApacheTika\Clients\WebClient->request()
#2 C:\WORKAREA\Projects\cloudconversa\research\tika\tika.php(12): Vaites\ApacheTika\Client->getMainText()
#3 {main}
  thrown in C:\WORKAREA\Projects\cloudconversa\research\tika\vendor\vaites\php-apache-tika\src\Clients\WebClient.php on line 642

this because on Client.php row 540 checks for "invalid remote file" before try to "download remote file if required only for integrated downloader". I switched these two blocks. Than I also added on row 637 the CURLOPT_FOLLOWLOCATION option to follow redirects and avoid errors when download URL has a 301.

Hope this can be useful. thank you!

vaites commented 4 months ago

Thanks @denistorresan, will take a look and try to merge the PR ASAP.

vaites commented 3 months ago

Merged and published, thanks @denistorresan!