jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.83k stars 680 forks source link

KeyError: "N" when I try to extract a table from a splitted pdf #1204

Closed Readix1 closed 2 months ago

Readix1 commented 2 months ago

Describe the bug

To overcome the problem of long pdfs taking too long to open, I use the PyPDF2 library to keep only the pages with a table that I need. Then I open the pdf that only contains a few pages, which makes the opening faster Unfortunately in some cases, this error occurs.

Have you tried repairing the PDF?

I can't install Ghostscript due to rights issues and I wouldn't like to have to install this application in addition to mine to run my application.

Code to reproduce the problem

from PyPDF2 import PdfReader, PdfWriter
import os

def pdfSplitter(filePath, OutputPath, start_page=0):
    file_name = filePath.replace("\\", "/").split("/")[-1].replace(".pdf", "")
    if os.path.exists(OutputPath + f'\\temp_only_tables_{file_name}.pdf'):
        return {}

    pdfFile = open(filePath,'rb')

    reader = PdfReader(pdfFile)
    totalPages = len(reader.pages)
    writer = PdfWriter()

    cpt=0
    dict_page = {}

    for i in range(start_page, totalPages):
        curPage = reader.pages[i]
        txt = curPage.extract_text()
        if *condition to detect table*:

            writer.add_page(curPage)
            dict_page[cpt]=i
            cpt+=1

    outputFile = open(OutputPath + f'\\temp_only_tables_{file_name}.pdf', 'wb')
    writer.write(outputFile)
    outputFile.close()

    return dict_page
pdfSplitter(file, "Temp/",)
pdf = pdfplumber.open("Temp/"+"temp_only_tables_"+file)
page = pdf.pages
txt_plumber = page[0].extract_table()

PDF file

I can't give my PDF and they are too big to run pdf-redactor ( around 1000 pages ).

Expected behavior

Extract without any problem

Actual behavior

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\src\utils\extract.py:17, in extract_table_from_plumber(plumber, page_number)
     [15](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/src/utils/extract.py:15) def extract_table_from_plumber(plumber, page_number):
     [16](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/src/utils/extract.py:16)     page = plumber.pages[page_number]
---> [17](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/src/utils/extract.py:17)     table = page.extract_table()
     [18](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/src/utils/extract.py:18)     return table

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\.env\lib\site-packages\pdfplumber\page.py:467, in Page.extract_table(self, table_settings)
    [463](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:463) def extract_table(
    [464](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:464)     self, table_settings: Optional[T_table_settings] = None
    [465](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:465) ) -> Optional[List[List[Optional[str]]]]:
    [466](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:466)     tset = TableSettings.resolve(table_settings)
--> [467](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:467)     table = self.find_table(tset)
    [468](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:468)     if table is None:
    [469](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:469)         return None

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\.env\lib\site-packages\pdfplumber\page.py:443, in Page.find_table(self, table_settings)
    [439](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:439) def find_table(
    [440](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:440)     self, table_settings: Optional[T_table_settings] = None
    [441](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:441) ) -> Optional[Table]:
    [442](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:442)     tset = TableSettings.resolve(table_settings)
--> [443](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:443)     tables = self.find_tables(tset)
    [445](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:445)     if len(tables) == 0:
    [446](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:446)         return None

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\.env\lib\site-packages\pdfplumber\page.py:437, in Page.find_tables(self, table_settings)
    [433](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:433) def find_tables(
    [434](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:434)     self, table_settings: Optional[T_table_settings] = None
    [435](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:435) ) -> List[Table]:
    [436](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:436)     tset = TableSettings.resolve(table_settings)
--> [437](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:437)     return TableFinder(self, tset).tables

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\.env\lib\site-packages\pdfplumber\table.py:569, in TableFinder.__init__(self, page, settings)
    [567](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:567) self.page = page
    [568](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:568) self.settings = TableSettings.resolve(settings)
--> [569](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:569) self.edges = self.get_edges()
    [570](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:570) self.intersections = edges_to_intersections(
    [571](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:571)     self.edges,
    [572](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:572)     self.settings.intersection_x_tolerance,
    [573](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:573)     self.settings.intersection_y_tolerance,
    [574](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:574) )
    [575](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:575) self.cells = intersections_to_cells(self.intersections)

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\.env\lib\site-packages\pdfplumber\table.py:620, in TableFinder.get_edges(self)
    [608](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:608)         v_explicit.append(
    [609](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:609)             {
    [610](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:610)                 "x0": desc,
   (...)
    [616](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:616)             }
    [617](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:617)         )
    [619](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:619) if v_strat == "lines":
--> [620](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:620)     v_base = utils.filter_edges(self.page.edges, "v")
    [621](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:621) elif v_strat == "lines_strict":
    [622](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/table.py:622)     v_base = utils.filter_edges(self.page.edges, "v", edge_type="line")

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\.env\lib\site-packages\pdfplumber\container.py:88, in Container.edges(self)
     [86](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/container.py:86) if hasattr(self, "_edges"):
     [87](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/container.py:87)     return self._edges
---> [88](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/container.py:88) line_edges = list(map(utils.line_to_edge, self.lines))
     [89](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/container.py:89) self._edges: T_obj_list = line_edges + self.rect_edges + self.curve_edges
     [90](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/container.py:90) return self._edges

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\.env\lib\site-packages\pdfplumber\container.py:38, in Container.lines(self)
     [36](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/container.py:36) @property
     [37](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/container.py:37) def lines(self) -> T_obj_list:
---> [38](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/container.py:38)     return self.objects.get("line", [])

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\.env\lib\site-packages\pdfplumber\page.py:329, in Page.objects(self)
    [327](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:327) if hasattr(self, "_objects"):
    [328](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:328)     return self._objects
--> [329](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:329) self._objects: Dict[str, T_obj_list] = self.parse_objects()
    [330](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:330) return self._objects

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\.env\lib\site-packages\pdfplumber\page.py:418, in Page.parse_objects(self)
    [416](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:416) def parse_objects(self) -> Dict[str, T_obj_list]:
    [417](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:417)     objects: Dict[str, T_obj_list] = {}
--> [418](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:418)     for obj in self.iter_layout_objects(self.layout._objs):
    [419](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:419)         kind = obj["object_type"]
    [420](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:420)         if kind in ["anno"]:

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\.env\lib\site-packages\pdfplumber\page.py:275, in Page.layout(self)
    [269](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:269) device = PDFPageAggregatorWithMarkedContent(
    [270](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:270)     self.pdf.rsrcmgr,
    [271](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:271)     pageno=self.page_number,
    [272](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:272)     laparams=self.pdf.laparams,
    [273](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:273) )
    [274](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:274) interpreter = PDFPageInterpreter(self.pdf.rsrcmgr, device)
--> [275](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:275) interpreter.process_page(self.page_obj)
    [276](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:276) self._layout: LTPage = device.get_result()
    [277](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfplumber/page.py:277) return self._layout

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\.env\lib\site-packages\pdfminer\pdfinterp.py:997, in PDFPageInterpreter.process_page(self, page)
    [995](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:995)     ctm = (1, 0, 0, 1, -x0, -y0)
    [996](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:996) self.device.begin_page(page, ctm)
--> [997](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:997) self.render_contents(page.resources, page.contents, ctm=ctm)
    [998](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:998) self.device.end_page(page)
    [999](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:999) return

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\.env\lib\site-packages\pdfminer\pdfinterp.py:1014, in PDFPageInterpreter.render_contents(self, resources, streams, ctm)
   [1007](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:1007) """Render the content streams.
   [1008](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:1008) 
   [1009](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:1009) This method may be called recursively.
   [1010](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:1010) """
   [1011](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:1011) log.debug(
   [1012](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:1012)     "render_contents: resources=%r, streams=%r, ctm=%r", resources, streams, ctm
   [1013](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:1013) )
-> [1014](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:1014) self.init_resources(resources)
   [1015](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:1015) self.init_state(ctm)
   [1016](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:1016) self.execute(list_value(streams))

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\.env\lib\site-packages\pdfminer\pdfinterp.py:387, in PDFPageInterpreter.init_resources(self, resources)
    [385](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:385) elif k == "ColorSpace":
    [386](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:386)     for (csid, spec) in dict_value(v).items():
--> [387](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:387)         colorspace = get_colorspace(resolve1(spec))
    [388](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:388)         if colorspace is not None:
    [389](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:389)             self.csmap[csid] = colorspace

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\.env\lib\site-packages\pdfminer\pdfinterp.py:370, in PDFPageInterpreter.init_resources.<locals>.get_colorspace(spec)
    [368](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:368)     name = literal_name(spec)
    [369](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:369) if name == "ICCBased" and isinstance(spec, list) and 2 <= len(spec):
--> [370](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:370)     return PDFColorSpace(name, stream_value(spec[1])["N"])
    [371](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:371) elif name == "DeviceN" and isinstance(spec, list) and 2 <= len(spec):
    [372](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdfinterp.py:372)     return PDFColorSpace(name, len(list_value(spec[1])))

File c:\Users\l069395\Documents\Projets\pecto-S.SEAT\pecto_code\.env\lib\site-packages\pdfminer\pdftypes.py:285, in PDFStream.__getitem__(self, name)
    [284](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdftypes.py:284) def __getitem__(self, name: str) -> Any:
--> [285](file:///C:/Users/l069395/Documents/Projets/pecto-S.SEAT/pecto_code/.env/lib/site-packages/pdfminer/pdftypes.py:285)     return self.attrs[name]

KeyError: 'N'

Environment

jsvine commented 2 months ago

Hi @Readix1, and thanks for sharing this example. Based on (very helpful) stack trace you shared, the error you encountered appears to stem from pdfminer.six, pdfplumber's main dependency. My guess is that it relates to the way PyPDF2 is creating the split files. Unfortunately, this means that isn't much pdfplumber can do to resolve the issue, so I'm closing it for now.

Readix1 commented 2 months ago

Hi, Thank you for your answer. I understand. I need to find another way so.