nisaacson / pdf-extract

Node PDF Extract
MIT License
383 stars 76 forks source link

Error extracting data from PDF document: "No current point in closepath" #45

Open DigitalLeaves opened 1 year ago

DigitalLeaves commented 1 year ago

Hello, First of all, thanks a lot for the awesome work on this library. We have been using it for some time and are quite amazed by the work you made here. Today we run into this error from an apparently totally OK PDF:

  error: 'Syntax Error (30523): No current point in closepath\n' +
    'Syntax Error (30538): No current point in closepath\n' +
    'Syntax Error (30556): No current point in closepath\n' +
    'Syntax Error (30566): No current point in closepath\n',
  pdf_path: '../samples/515317730_121477412.pdf'

This is a searchable/text pdf, so it is using pdfOCR with the following options:

const ocrSearchableOptions = {
    type: 'text', // extract searchable text from PDF
    ocr_flags: ['--psm 1'], 
    enc: 'UTF-8',  
    mode: 'layout'
}

I can provide the PDF if needed to analyze it. Any help is greatly, greatly appreciated 🙏. Thanks a lot in advance.

DigitalLeaves commented 1 year ago

Any info or help about this greatly appreciated 🙏

DigitalLeaves commented 1 year ago

The issue seems to be related to the PDF which may be "broken" for PDFTOTEXT perhaps. However, other OCR tools and softwares seem to read it without problem (for example, Node PDF Text).

DigitalLeaves commented 1 year ago

More info, package pdf-text-extract uses pdftotext too but seems to work these files.