jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.
https://borbpdf.com/
Other
3.4k stars 147 forks source link

Fastest way to output a very long (>11k lines) paragraph? #126

Closed gettalong closed 2 years ago

gettalong commented 2 years ago

Hi there,

I'm trying to include borb in my line wrapping benchmark at https://hexapdf.gettalong.org/documentation/benchmarks/line_wrapping.html. The benchmark tests the automatic document layout facility of a library, laying out the text of Homer's Odyssey on pages with a height of 1000pt and widths of 400pt, 200pt, 100pt, and 50pt.

With the help of the examples and digging through the code itself, I managed to create a script that executes successfully but runs so long that I interrupted it. I'm not even sure at this point if borb automatically wraps boxes into new pages.

My first try was to add a single Paragraph(text) to the SingleColumnLayoutbut when it took too long, I tried using Paragraph(line) objects, for each line of the input file; still took to long (Why I did this? Because reportlab takes very long when presented with a single large paragraph but is fast when presented with individual lines).

Here is my current code:

from borb.pdf.document.document import Document
from borb.pdf.page.page import Page
from borb.pdf.canvas.layout.page_layout.multi_column_layout import SingleColumnLayout
from borb.pdf.canvas.layout.page_layout.page_layout import PageLayout
from borb.pdf.canvas.layout.text.paragraph import Paragraph
from borb.pdf.pdf import PDF
from borb.pdf.canvas.font.simple_font.true_type_font import TrueTypeFont
import sys

font = 'Times-roman'
if len(sys.argv) == 5:
    font = TrueTypeFont.true_type_font_from_file(sys.argv[4])

doc = Document()
page = Page(width=int(sys.argv[2]), height=1000)
doc.append_page(page)
layout = SingleColumnLayout(page, horizontal_margin=0, vertical_margin=0)

text = open(sys.argv[1], 'r').read()
L=list(map(str.strip, text.split('\n')))
for para in L:
    if not para:
        para = ':'
    print(para)
    layout.add(Paragraph(para, font=font, font_size=10))

with open(sys.argv[3], "wb") as handle:
    PDF.dumps(handle, doc)

Is slow text output (or performance in general) a known issue? Or am I just using borb in the wrong way?

jorisschellekens commented 2 years ago

I ran the following code:

import datetime
import io
import time
import typing
import unittest
from pathlib import Path

import requests

from borb.io.read.tokenize.high_level_tokenizer import HighLevelTokenizer
from borb.io.read.tokenize.low_level_tokenizer import Token
from borb.pdf import Document, Page, SingleColumnLayout, PageLayout, Paragraph, PDF

unittest.TestLoader.sortTestMethodsUsing = None

class TestTextWrappingPerformance(unittest.TestCase):

    def __init__(self, methodName="runTest"):
        super().__init__(methodName)
        # find output dir
        p: Path = Path(__file__).parent
        while "output" not in [x.stem for x in p.iterdir() if x.is_dir()]:
            p = p.parent
        p = p / "output"
        self.output_dir = Path(p, Path(__file__).stem.replace(".py", ""))
        if not self.output_dir.exists():
            self.output_dir.mkdir()

    def test_layout_odyssey(self):
        text: str = requests.get("https://www.gutenberg.org/files/1727/old/1727.txt").text

        # do this for the first 30Kb
        for i in range(1024, min(len(text), 1024 * 30), 1024):
            # create Document
            doc: Document = Document()

            # create Page
            page: Page = Page()
            doc.add_page(page)

            # create PageLayout
            layout: PageLayout = SingleColumnLayout(page)

            t0: float = time.time()
            lines: typing.List[str] = [x.strip() for x in text[0:i].split("\n")]
            for l in lines:
                if l == "":
                    l = ":"
                layout.add(Paragraph(l))
            t0 = time.time() - t0

            # print
            print("%d %f" % (i, t0))

            # write
            output_file: Path = self.output_dir / ("output_%d.pdf" % i)
            with open(output_file, "wb") as pdf_file_handle:
                PDF.dumps(pdf_file_handle, doc)

And got the following output

1024 1.440810
2048 2.322943
3072 3.198854
4096 4.423079
5120 4.974145
6144 6.643198
7168 7.606024
8192 8.244912
9216 10.091630
10240 10.234841
11264 11.271525
12288 11.813610
13312 12.456861
14336 13.384772
15360 14.486026
16384 15.536561
17408 16.595588
18432 17.577215
19456 18.679573
20480 19.528780
21504 20.546216
22528 21.519913
23552 22.527853
24576 23.601554
25600 24.617576
26624 25.684642
27648 26.677069
28672 27.648117
29696 28.649084
30720 29.697664
31744 30.793042
32768 31.930113
33792 32.983547
34816 33.866674
35840 35.088346
36864 36.339884

Does that correspond to your findings?

gettalong commented 2 years ago

If I read your code correctly, yes. Given that the file has about 682K bytes, it would take very long to render that file, orders of magnitude slower than the other libraries benchmarked.

jorisschellekens commented 2 years ago

I assume the performance of text wrapping in borb would indeed continue to scale linearly with the input size. Hence 682K bytes ought to take roughly 680 seconds. Making borb 37 to 38 times slower than the currently slowest library you have tried.

I am not concerned about this at this point in time, for two reasons:

I am going to keep the test, and monitor performance. I may even investigate and profile the code. But as I mentioned before, it is currently not my priority.

Kind regards, Joris Schellekens

gettalong commented 2 years ago

Thanks for the explanation!

Linear scaling O(n) would actually be very good, so if you manage to improve the line wrapping algorithm itself, that would certainly help!

Even though one might not add 11k lines in one document, the performance of borb seems to be similar for the case where one would add 100 lines in 100 different documents. So this would also limit the usage of borb in batch processing, say, when creating invoices or salary statements en masse.

All the best!

jorisschellekens commented 2 years ago

Hi there,

I revisited the example code I showed you earlier. Come to think of it, it would make more sense (to improve speed at least) to pre-load the Font. By not specifying a Font in the Paragraph constructor, you are asking Paragraph to use its default (Helvetica). That's all fine and well, but it will keep loading the Font every single time you construct the Paragraph.

That, combined with the performance gains from the previous work on LayoutElement gives me the following output:

1024    0.345911
2048    0.756442
3072    1.189494
4096    1.674580
5120    2.080195
6144    2.518857
7168    3.189230
8192    3.789248
9216    3.882730

I hope this is already somewhat more acceptable. It's certainly an improvement compared to the earlier numbers.

Kind regards, Joris Schellekens

gettalong commented 2 years ago

@jorisschellekens Great, that looks much better now, a 3x improvement!

gettalong commented 1 year ago

@jorisschellekens Hi there! I recently implemented table support in HexaPDF and thought about including borb in the table benchmark (https://github.com/gettalong/hexapdf/tree/devel/benchmark/table/).

I basically followed the table example from borb-examples repository (https://github.com/jorisschellekens/borb-examples#321-fixedcolumnwidthtable). However, similar to the line wrapping benchmark borb is very slow. It takes about 2 seconds for rendering a PDF with a table that has 3 columns and 10 rows where the middle column has an image (always the same one). This is about 20x slower than reportlab or 12x slower than fpdf2.

Just wanted to let you know if this is something you want to improve.