jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

Extract tables randomly moves the bbox 20 pixels away #810

Closed merionum closed 1 year ago

merionum commented 1 year ago

Discussed in https://github.com/jsvine/pdfplumber/discussions/809

Originally posted by **merionum** February 10, 2023 Hi guys! Thanks for the awesome library! I am doing a large parsing task and I noticed that for some of my documents I randomly encounter this problem where the bboxes are just moved away 20 pixels in both dimensions. Could you help me to understand what goes wrong here please? ![image psd(1)](https://user-images.githubusercontent.com/21173351/217958635-0d3e2817-e6f1-4cdc-b915-80aaa9d364a0.png)
jsvine commented 1 year ago

Thanks for the kind words about pdfplumber, @merionum! Strange thing you're seeing. It seems the PDF is malformed. Repairing it fixes the issue:

gs -o no_codes-3-1-repaired.pdf   \
  -sDEVICE=pdfwrite   \
  -dPDFSETTINGS=/prepress    \
  no_codes-3-1.pdf;
The following errors were encountered at least once while processing this file:                               
        error reading a stream                                                                                

   **** This file had errors that were repaired or ignored.                                                   
   **** The file was produced by:                                                                             
   **** >>>> iLovePDF <<<<                                                                                    
   **** Please notify the author of the software that produced this                                           
   **** file that it does not conform to Adobe's published PDF                                                
   **** specification.                                                                                        

test