jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.
https://borbpdf.com/
Other
3.37k stars 148 forks source link

AssertionError: A Rectangle must have a non-negative width #182

Closed DrPlanecraft closed 10 months ago

DrPlanecraft commented 10 months ago

I am trying to load a PDF downloaded from arxiv

import l as L # I took an L
from pprint import pprint
from borb.pdf.canvas.geometry.rectangle import Rectangle
from borb.pdf import PDF, SingleColumnLayout, ChunkOfText, PDF
from borb.pdf.canvas.line_art.line_art_factory import LineArtFactory
from borb.toolkit import SimpleLineOfTextExtraction, SimpleParagraphExtraction, ImageExtraction

ITE_SSL = "ITE_Study_Status_Letter.pdf"
MP1 = "mergePart_001.pdf"
MP2 = "mergePart_002.pdf"
PAPER = "Arxiv_Paper.pdf"
OUTPUT = 'output.pdf' 
R = "rb"
W = "wb"
LOT_EXTRACTION = SimpleLineOfTextExtraction()
PARA_EXTRACTION = SimpleParagraphExtraction()
with open(PAPER, R) as file:
    original = PDF.loads(file=file,event_listeners=[LOT_EXTRACTION, PARA_EXTRACTION])
    print()

for index in range(int(original["XRef"]["Trailer"]["Root"]["Pages"]["Count"])):
    for para, line in zip(tuple(i.get_text() for  i in PARA_EXTRACTION.get_lines_of_text()[index]),tuple(i.get_text() for i in LOT_EXTRACTION.get_lines_of_text()[index])):
        print("Paragraph:",para)
        print("Line:",line)
        print()
    print()

Expected behaviour I Expect it to exit with no issues, after printing out the differences between the LineOfTextExtraction and the ParagraphExtraction

Desktop (please complete the following information):

jorisschellekens commented 10 months ago

Found the bug.

in SimpleLineOfTextExtraction (line 78):

sorted(chunks_of_text, key=cmp_to_key(LeftToRightComparator.cmp))

should be

chunks_of_text = sorted(chunks_of_text, key=cmp_to_key(LeftToRightComparator.cmp))

Otherwise it's not really sorting the rendering instructions, which means it runs the risk of the x-coordinates being out of order. Which causes it to generate a Rectangle with negative width

I'm fixing this (and a similar occurrence of sorted) in the next release.

Kind regards, Joris Schellekens

DrPlanecraft commented 10 months ago

Found the bug.

in SimpleLineOfTextExtraction (line 78):

sorted(chunks_of_text, key=cmp_to_key(LeftToRightComparator.cmp))

should be

chunks_of_text = sorted(chunks_of_text, key=cmp_to_key(LeftToRightComparator.cmp))

Otherwise it's not really sorting the rendering instructions, which means it runs the risk of the x-coordinates being out of order. Which causes it to generate a Rectangle with negative width

I'm fixing this (and a similar occurrence of sorted) in the next release.

Kind regards, Joris Schellekens

Thank You for the reply! I originally closed the issue as I found bugs inside the code I have provided