Closed NinoSkopac closed 7 years ago
A second parameter for getText() an other methods could be a callable and each chunk can be passed to it. Let me make some tests...
Sounds like a good idea.
You could implement it like this
$options[CURLOPT_WRITEFUNCTION] = function(&$curl, $data) {
print $data;
return strlen($data);
};
I've been working on this feature and almost have it. I only found a small problem: using Tika as a server and cURL to get the response, we can't define the chunk size and callback is called each 8k (aprox, not always). This is not a huge limitation, but I want to offer this feature.
Another option is offer another class that uses sockets instead of cURL, so the package will have no dependencies at all. This class could be used only if cURL is not present or chunk size is needed.
Any suggestions?
The feature is implemented in both command line and web clients but without chunk size support on web client. Will add more tests and some new features before release it as stable (next version will be 0.4.0).
You can test it if you want using dev-master
on your composer.json.
Hey, thanks for this! The 8k limitation is totally fine IMO.
The 0.4.0 version is just released with support for callbacks in text extraction methods. Hope it works as you need.
Great, this is a great feature, thank you sir.
This is how I implemented it months ago (because I needed to keep a progress bar updated based on how much text has been extracted thus far):
<?php
/**
* Created by PhpStorm.
* User: ninoskopac
* Date: 23/04/2017
* Time: 04:11
*/
// @TODO add strict
namespace Read2Me\Parsers;
use Read2Me\IO;
use Symfony\Component\Process\Process;
use Vaites\ApacheTika\Client as TikaClient;
/**
* Class Document
* @package Read2Me\Parsers
*
*/
class Document
{
public $options = [];
protected $hostname;
protected $port;
private $binary = __DIR__ . '/../../../bin/tika-server-1.14.jar';
private $client;
public function __construct(string $hostname = '127.0.0.1', int $port = 9998)
{
$this->hostname = $hostname;
$this->port = $port;
if (!$this->isTikaServerRunning())
$this->startTikeServer();
$this->options[CURLOPT_TIMEOUT] = 100;
$this->client = TikaClient::make($this->hostname, $this->port, $this->options);
}
public function getText(string $file, ?\Closure $writeFunction = null) : string {
if ($writeFunction === null)
return $this->client->getText($file);
$chunks = '';
$this->options[CURLOPT_WRITEFUNCTION] = function(&$curl, $data) use(&$chunks, $writeFunction) {
$chunks .= $data;
$chunkLen = strlen($data);
$writeFunction($data, $chunkLen, strlen($chunks));
return strlen($data);
};
$this->client->setOptions($this->options);
$this->client->getText($file);
return $chunks;
}
public function __call(string $name, array $arguments)
{
if (!method_exists($this->client, $name))
throw new \RuntimeException('%s method does not exist in %s', $name, get_class($this->client));
return $this->client->$name(...$arguments);
}
private function isTikaServerRunning() : bool {
$sock = @fsockopen($this->hostname, $this->port, $errno, $errstr, 5);
if (!$sock)
return false;
fclose($sock);
return true;
}
private function startTikeServer() : void {
if (!is_file($this->binary))
throw new \RuntimeException('%s file does not exist', $this->binary);
$tempFile = IO::getTemporaryFile();
$tempFileUri = $tempFile->getUri();
// start Tika-Server
$commandPattern = 'nohup java -jar %s --host=%s --port=%d >%s 2>&1 &';
$command = sprintf($commandPattern, $this->binary, $this->hostname, $this->port, $tempFileUri);
$process = new Process($command);
$process->run();
while (true) {
if ($this->isTikaServerRunning() || strpos($tempFile->getContents(), 'INFO: Started') !== false)
break;
sleep(1);
}
}
}
Hey, is it possible to get chunks of text as soon as they're processed, as opposed to getting a large piece of text after processing's done>