Problem with non UTF-8 encoded PDFs

sparx82 commented 8 years ago

Hi,

Parsing a PDF doesn't work if the strings inside the PDF are not UTF-8 encoded.

I have a PDF where the PDF parser returns an ANSI (that's what Notepad++ tells me) coded string containing special characters. This document is not indexed, maybe because the string is expected to be UTF-8 (document/pdf.php Line 45 and 47). If I encode the string ($body) to UTF-8 (using utf8_encode) before, it works, but this is not a suitable solution as already UTF-8 encoded strings will be encoded again. I think the string ($body) should be UTF-8 encoded if it is encoded in a different encoding before adding it to the body field.

That's what I did which works with non UTF-8 strings:

$body = $pdf->getText();

$body = utf8_encode($body);

// Store contents
if ($storeContent) {
   $this->addField(Document\Field::Text('body', $body, 'UTF-8'));
} else {
   $this->addField(Document\Field::UnStored('body', $body, 'UTF-8'));
}

DeepDiver1975 commented 8 years ago

Please submit a PR so that we can review and test the change. Thanks a lot!

sparx82 commented 8 years ago

I can do that but I need to figure out what's really wrong first. Just using utf8_encode is definitely not a proper solution as this only converts ISO-8859-1 to UTF-8. A quick Google search showed that there's maybe not an easy solution and as I'm not an character encoding specialist, this may take a while...

sparx82 commented 8 years ago

I did some further investigation and it showed that it is either a bug in the PDF parser or the PDF I have is not according to the standard. The PDF parser should always return a UTF8 coded string but it doesn't do that with my PDF file.

owncloud-archive / search_lucene

Problem with non UTF-8 encoded PDFs #115