Wrong Character - can detect this ?

davidribatto commented 7 months ago

PHP Version: 8.2
PDFParser Version: 2.9

Description:

I want to parse some CV, and I have sometimes wrong character. I would to try to parse correctly the pdf, and if not possible, if I have some wrong characters, return an empty string to $text.

PDF input

curriculum_vitae_Victor-Faria.pdf

Expected output & actual output

The result �� ue Oscar--Bider 104b, 1950 Sion �� ;��2�� ;��4��;�� ;��2�� ;��;��; �� ;�� ;�� @�� ;��;��;� ��;��;�� ;�� ;��;�� ;�� ;��;��;��;�� ;�� ;��;� �� ;�� ;�� M�� ; �� ;�� ;�� ;�� ;�� @��;��;��;� ��;��;�� ;�� Cap 3D �� 3

Code

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($pdf_temp_path);
   if ($pdf) {
    $pages = $pdf->getPages(); 
   }
   if ($pages && !empty($pages)) {  
    try {
        $text = $pages[0]->getText(); 
    }  catch (Exception $e) {
            echo 'Erreur lors de la vérification du PDF : ' . $e->getMessage();
         }
    } else { $text = '';}

        $text = str_replace(
            array('\\\\', '\(', '\)', '\n', '\r', '\t', '\f', '\ '),
            array('\\', '(', ')', "\n", "\r", "\t", "\f", ' '),
            $text
       );
return $text;

Thanks for your help

GreyWyvern commented 7 months ago

Some kind of decoding issue for sure. 2.7.0 has it, as well as 2.9.0.

Maybe another Identity-H problem?

davidribatto commented 7 months ago

Some kind of decoding issue for sure. 2.7.0 has it, as well as 2.9.0.

Maybe another Identity-H problem?

Have you try with the file ? Have you the same result ?

Is there a way to detect if we have some wrong character ? I test with regex but this is not conclusive.

smalot / pdfparser