vaites / php-apache-tika

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
MIT License
116 stars 22 forks source link

getRecursiveMetadata discards data #23

Closed adjenks closed 4 years ago

adjenks commented 5 years ago

Let me start by saying great library. Well written.

The issue is that Client->getRecursiveMetadata() discards all but the first document.

getRecursiveMetadata() calls Metadata::make($response, $file); and make() is defined like so:

    /**
     * Return an instance of Metadata based on content type
     *
     * @param   string  $response
     * @param   string  $file
     * @return  \Vaites\ApacheTika\Metadata\Metadata
     * @throws  \Exception
     */
    public static function make($response, $file)
    {
        // an empty response throws an error
        if(empty($response) || trim($response) == '')
        {
            throw new Exception('Empty response');
        }

        // decode the JSON response
        $json = json_decode($response);

        // get the meta info
        $meta = is_array($json) ? current($json) : $json; //------ Only keeps first element if it's an array.

        // exceptions if metadata is not valid
        if(json_last_error())
        {
            $message = function_exists('json_last_error_msg') ? json_last_error_msg() : 'Error parsing JSON response';

            throw new Exception($message, json_last_error());
        }

        // get content type
        $mime = is_array($meta->{'Content-Type'}) ? current($meta->{'Content-Type'}) : $meta->{'Content-Type'};

        // instance based on content type
        switch(current(explode('/', $mime)))
        {
            case 'image':
                $instance = new ImageMetadata($meta, $file); //------ Only uses first element if array
                break;

            default:
                $instance = new DocumentMetadata($meta, $file); //------ Only uses first element if array
        }

        return $instance; //------ Instance built and returned using only first element if array
    }

(Lines of interest highlighted with //------)

I would suggest refactoring to have getRecursiveMetadata() return an Array, or a new object class like IterableMetadata that implements the Iterable interface.

adjenks commented 5 years ago

To create an example file, just attach a word document to an email and parse it.

vaites commented 5 years ago

Thanks @adjenks, will take a look. I think you're right... This method was implemented to save calls, not thinking on multiple files. Need to think how to implement it...

adjenks commented 5 years ago

@vaites Cool, thank you for making the library.

vaites commented 5 years ago

Can I ask you what version of PHP are you using?.

adjenks commented 5 years ago

@vaites PHP 7.1.33

vaites commented 4 years ago

Thanks @adjenks, I'm working on the 1.0.0 release that will drop support for PHP 5. The minimum version will be 7.1 and this fix must be added into a new major release because is a breaking change. Do yo agree?

adjenks commented 4 years ago

Yes, I think that this feature is required and it is a breaking change, so I think the change should be made and the major version should be rolled up. All sounds good to me.

vaites commented 4 years ago

OK thanks. Will try to release the new version this month.

adjenks commented 4 years ago

Make any progress?

vaites commented 4 years ago

Hi @adjenks, this is a breaking change to the 0.x branch so I planned to add it to the 1.x branch, wich will be compatible with PHP 7 and have more breaking changes. The development is going slower than I expected, so I'm sorry.

vaites commented 4 years ago

I've just uploaded an initial version of this feature. The 1.x branch is almost ready (I only need to test more) and you can test it with dev-master on your composer.json. If you can take a look and tell me if this is the behaviour you expect, will help me a lot.

vaites commented 4 years ago

I apologize for the delay in resolving this issue. Version 1.0 has been published in which a vector with the data of each file is returned when calling Client->getRecursiveMetadata().

The returned array looks like this for a file called sample.zip:

[
    'sample.zip' => new Metadata(),
    'sample.zip/file.docx' => new DocumentMetadata()
]
adjenks commented 4 years ago

Awesome. Sorry I couldn't test it, I haven't been working with Tika for a while. Probably later. Good work.

vaites commented 4 years ago

Don't worry, I added some tests to PHPUnit. Hope it solve your issue.