smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.41k stars 537 forks source link

phpunit pdf tofu characters detection #741

Open 8ctopus opened 1 month ago

8ctopus commented 1 month ago

I'm trying to design a phpunit test to detect tofu characters within a generated pdf. (If none of the fonts included in the pdf supports the language within the pdf, tofu characters will appear.)

First, I tried to get the pdf text, however the getText method always returns the correct unicode text, even if tofu characters are seen within the pdf.

Second, I've considered listing the available fonts and simply reviewing that all the required fonts are present.

$parser = new PdfParser();
$document = $parser->parseFile($pdf);

$fonts = $document->getFonts();

foreach ($fonts as $font) {
    $font->getDetails();
}

Would anyone have a better approach to suggest?