vaites / php-apache-tika

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
MIT License
114 stars 22 forks source link

Getting response text in chunks? #6

Closed NinoSkopac closed 7 years ago

NinoSkopac commented 7 years ago

Hey, is it possible to get chunks of text as soon as they're processed, as opposed to getting a large piece of text after processing's done>

vaites commented 7 years ago

A second parameter for getText() an other methods could be a callable and each chunk can be passed to it. Let me make some tests...

NinoSkopac commented 7 years ago

Sounds like a good idea.

NinoSkopac commented 7 years ago

You could implement it like this

$options[CURLOPT_WRITEFUNCTION] = function(&$curl, $data) {
    print $data;

    return strlen($data);
};
vaites commented 7 years ago

I've been working on this feature and almost have it. I only found a small problem: using Tika as a server and cURL to get the response, we can't define the chunk size and callback is called each 8k (aprox, not always). This is not a huge limitation, but I want to offer this feature.

Another option is offer another class that uses sockets instead of cURL, so the package will have no dependencies at all. This class could be used only if cURL is not present or chunk size is needed.

Any suggestions?

vaites commented 7 years ago

The feature is implemented in both command line and web clients but without chunk size support on web client. Will add more tests and some new features before release it as stable (next version will be 0.4.0).

You can test it if you want using dev-master on your composer.json.

NinoSkopac commented 7 years ago

Hey, thanks for this! The 8k limitation is totally fine IMO.

vaites commented 7 years ago

The 0.4.0 version is just released with support for callbacks in text extraction methods. Hope it works as you need.

NinoSkopac commented 7 years ago

Great, this is a great feature, thank you sir.

This is how I implemented it months ago (because I needed to keep a progress bar updated based on how much text has been extracted thus far):

<?php
/**
 * Created by PhpStorm.
 * User: ninoskopac
 * Date: 23/04/2017
 * Time: 04:11
 */
// @TODO add strict
namespace Read2Me\Parsers;
use Read2Me\IO;
use Symfony\Component\Process\Process;
use Vaites\ApacheTika\Client as TikaClient;

/**
 * Class Document
 * @package Read2Me\Parsers
 *
 */
class Document
{
    public $options = [];

    protected $hostname;
    protected $port;

    private $binary = __DIR__ . '/../../../bin/tika-server-1.14.jar';
    private $client;

    public function __construct(string $hostname = '127.0.0.1', int $port = 9998)
    {
        $this->hostname = $hostname;
        $this->port = $port;

        if (!$this->isTikaServerRunning())
            $this->startTikeServer();

        $this->options[CURLOPT_TIMEOUT] = 100;
        $this->client = TikaClient::make($this->hostname, $this->port, $this->options);
    }

    public function getText(string $file, ?\Closure $writeFunction = null) : string {
        if ($writeFunction === null)
            return $this->client->getText($file);

        $chunks = '';
        $this->options[CURLOPT_WRITEFUNCTION] = function(&$curl, $data) use(&$chunks, $writeFunction) {
            $chunks .= $data;
            $chunkLen = strlen($data);

            $writeFunction($data, $chunkLen, strlen($chunks));

            return strlen($data);
        };
        $this->client->setOptions($this->options);
        $this->client->getText($file);

        return $chunks;
    }

    public function __call(string $name, array $arguments)
    {
        if (!method_exists($this->client, $name))
            throw new \RuntimeException('%s method does not exist in %s', $name, get_class($this->client));

        return $this->client->$name(...$arguments);
    }

    private function isTikaServerRunning() : bool {
        $sock = @fsockopen($this->hostname, $this->port, $errno, $errstr, 5);

        if (!$sock)
            return false;

        fclose($sock);

        return true;
    }

    private function startTikeServer() : void {
        if (!is_file($this->binary))
            throw new \RuntimeException('%s file does not exist', $this->binary);

        $tempFile = IO::getTemporaryFile();
        $tempFileUri = $tempFile->getUri();

        // start Tika-Server
        $commandPattern = 'nohup java -jar %s --host=%s --port=%d >%s 2>&1 &';
        $command = sprintf($commandPattern, $this->binary, $this->hostname, $this->port, $tempFileUri);
        $process = new Process($command);
        $process->run();

        while (true) {
            if ($this->isTikaServerRunning() || strpos($tempFile->getContents(), 'INFO: Started') !== false)
                break;

            sleep(1);
        }
    }
}