vaites / php-apache-tika

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
MIT License
114 stars 22 forks source link

Improve the extensibility of the WebClient #17

Closed svaningelgem closed 5 years ago

svaningelgem commented 5 years ago

Right now there is 'huge' switch statement around line 500. The best would be to split this off and call this. Much like this:


    protected function getResourcesForType($type, $file = null) {
      $resource = '';
      $headers  = [];

        switch($type)
        {
            case 'html':
                $resource = 'tika';
                $headers[] = 'Accept: text/html';
                break;

            case 'lang':
                $resource = 'language/stream';
                break;

            case 'mime':
                $name = basename($file);
                $resource = 'detect/stream';
                $headers[] = "Content-Disposition: attachment, filename=$name";
                break;

            case 'meta':
                $resource = 'meta';
                $headers[] = 'Accept: application/json';
                break;

            case 'text':
                $resource = 'tika';
                $headers[] = 'Accept: text/plain';
                break;

            case 'text-main':
                $resource = 'tika/main';
                $headers[] = 'Accept: text/plain';
                break;

            case 'detectors':
            case 'parsers':
            case 'mime-types':
            case 'version':
                $resource = $type;
                break;

            default:
                throw new Exception("Unknown type $type");
        }

        return [$resource, $headers];
    }

    /**
     * Get the parameters to make the request
     *
     * @link    https://wiki.apache.org/tika/TikaJAXRS#Specifying_a_URL_Instead_of_Putting_Bytes
     * @param   string  $type
     * @param   string  $file
     * @return  array
     * @throws  \Exception
     */
    protected function getParameters($type, $file = null)
    {
        $headers = [];

        if(!empty($file) && preg_match('/^http/', $file))
        {
            $headers[] = "fileUrl:$file";
        }

        return $this->getResourcesForType($type, $file);
    }

The advantage of this is that you could override the getResourcesForType() in your own class much like this:

    protected function getResourcesForType($type, $file = null) {
      if ( $type == 'rmeta/text' ) {
        return [$type, ['Accept: application/json']];
      }
      else {
        return parent::getResourcesForType($type, $file);
      }
    }

As such you could override/implement your own behaviours.

Then additionally we also need to override when you get a 200 back:

        if($status == 200)
        {
            if($type == 'meta')
            {
                $response = Metadata::make($response, $file);
            }

            // cache certain responses
            if(in_array($type, ['lang', 'meta']))
            {
                $this->cache[sha1($file)][$type] = $response;
            }
        }
        if($status == 200)
        {
            $response = $this->interpretSuccesfulResponse($type, $response, $file);
            $this->cacheResponse($type, $file, $response);
        }

both could henceforth be able to overridden in sub-classes.

svaningelgem commented 5 years ago

patch.txt

vaites commented 5 years ago

Hi @svaningelgem, I took a look to this:

  1. I think split getParameters method makes no sense, because exactly the same override can be done without doing this
  2. A valid response is only a 200 status code, any other code is an error or an unexpected response
  3. Cached response can differ between users, so I added a method to check if a type is cacheable that can be overriden

Thanks for your contribution

vaites commented 5 years ago

I also added a few methods to allow users to override the cache layer (see daf5ec471871ba845781792ae97020f36e041c2f)