Closed sparx82 closed 8 years ago
Please submit a PR so that we can review and test the change. Thanks a lot!
I can do that but I need to figure out what's really wrong first. Just using utf8_encode is definitely not a proper solution as this only converts ISO-8859-1 to UTF-8. A quick Google search showed that there's maybe not an easy solution and as I'm not an character encoding specialist, this may take a while...
I did some further investigation and it showed that it is either a bug in the PDF parser or the PDF I have is not according to the standard. The PDF parser should always return a UTF8 coded string but it doesn't do that with my PDF file.
Hi,
Parsing a PDF doesn't work if the strings inside the PDF are not UTF-8 encoded.
I have a PDF where the PDF parser returns an ANSI (that's what Notepad++ tells me) coded string containing special characters. This document is not indexed, maybe because the string is expected to be UTF-8 (document/pdf.php Line 45 and 47). If I encode the string ($body) to UTF-8 (using utf8_encode) before, it works, but this is not a suitable solution as already UTF-8 encoded strings will be encoded again. I think the string ($body) should be UTF-8 encoded if it is encoded in a different encoding before adding it to the body field.
That's what I did which works with non UTF-8 strings: