vaites / php-apache-tika

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
MIT License
114 stars 22 forks source link

Add support for XML output #27

Closed Req closed 3 years ago

Req commented 3 years ago

Pretty self-explanatory, please add support for XML output :)

Tika already knows this and it's basically HTML, but XML and HTML are so different in philosophy that I can't extract HTML from a document and expect it to be XML-compliant.

vaites commented 3 years ago

Thanks for your suggestion @Req, but I think I can't understand what are you trying to obtain. Can you explain it with an example?. If you try to extract the formatted content of a document, this will be HTML and will not be XML-compliant, even if the response is wrapped inside an XML.

I think only the rmeta command allows something like you want: https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-MetadataResource (XMP format)

Req commented 3 years ago

I can't understand what are you trying to obtain. Can you explain it with an example?

Sorry for being unclear @vaites - I haven't used the server version, but the regular Tika jar can output text, HTML or XML specifying these output options for the content output:

        -x  or --xml           Output XHTML content (default)
        -h  or --html          Output HTML content
        -t  or --text          Output plain text content
        -T  or --text-main     Output plain text content (main content only)

While HTML is not XML, XHTML indeed is XML and can be parsed as such.

Source: https://tika.apache.org/1.4/gettingstarted.html

vaites commented 3 years ago

Thanks, will take a look. I think the best approach is to add a method to return the raw response and then some decorators to return XML and JSON.

Will keep you informed...

vaites commented 3 years ago

The output for -x and -h (the one used by this library) are just the same. I tried with the test samples on the samples folder:

java -jar bin/tika-app-1.24.1.jar --xml samples/sample6.pdf > xml.output
java -jar bin/tika-app-1.24.1.jar --html samples/sample6.pdf > html.output
diff html.output xml.output

The only differences are the encoded characters. So I don't understand the need of adding this format. Please, help me to understand this request...

Req commented 3 years ago

Encoded characters break XML parsers. My use case is that I grab the XML, parse it, query it for pages and grab the content there. This way I can easily get an array where the items correlate to PDF pages.

This could be done with an HTML parser as well, I know, but at least for me XML parsing is much more familiar

vaites commented 3 years ago

OK, no problem, I think I can add a Client::getXML() method to return it. The server doesn't seems to have this feature so I will try to modify the HTML output to give the same...

vaites commented 3 years ago

Well, I added the Client::getXHTML() method only for the CLI client, because the server does not support it and the modification of the HTML output seems difficult to maintain.

The 1.0.1 version will be available soon. Hope it solve your problem...

Req commented 3 years ago

Very cool, thank you!