Closed gettalong closed 2 years ago
I ran the following code:
import datetime
import io
import time
import typing
import unittest
from pathlib import Path
import requests
from borb.io.read.tokenize.high_level_tokenizer import HighLevelTokenizer
from borb.io.read.tokenize.low_level_tokenizer import Token
from borb.pdf import Document, Page, SingleColumnLayout, PageLayout, Paragraph, PDF
unittest.TestLoader.sortTestMethodsUsing = None
class TestTextWrappingPerformance(unittest.TestCase):
def __init__(self, methodName="runTest"):
super().__init__(methodName)
# find output dir
p: Path = Path(__file__).parent
while "output" not in [x.stem for x in p.iterdir() if x.is_dir()]:
p = p.parent
p = p / "output"
self.output_dir = Path(p, Path(__file__).stem.replace(".py", ""))
if not self.output_dir.exists():
self.output_dir.mkdir()
def test_layout_odyssey(self):
text: str = requests.get("https://www.gutenberg.org/files/1727/old/1727.txt").text
# do this for the first 30Kb
for i in range(1024, min(len(text), 1024 * 30), 1024):
# create Document
doc: Document = Document()
# create Page
page: Page = Page()
doc.add_page(page)
# create PageLayout
layout: PageLayout = SingleColumnLayout(page)
t0: float = time.time()
lines: typing.List[str] = [x.strip() for x in text[0:i].split("\n")]
for l in lines:
if l == "":
l = ":"
layout.add(Paragraph(l))
t0 = time.time() - t0
# print
print("%d %f" % (i, t0))
# write
output_file: Path = self.output_dir / ("output_%d.pdf" % i)
with open(output_file, "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, doc)
And got the following output
1024 1.440810
2048 2.322943
3072 3.198854
4096 4.423079
5120 4.974145
6144 6.643198
7168 7.606024
8192 8.244912
9216 10.091630
10240 10.234841
11264 11.271525
12288 11.813610
13312 12.456861
14336 13.384772
15360 14.486026
16384 15.536561
17408 16.595588
18432 17.577215
19456 18.679573
20480 19.528780
21504 20.546216
22528 21.519913
23552 22.527853
24576 23.601554
25600 24.617576
26624 25.684642
27648 26.677069
28672 27.648117
29696 28.649084
30720 29.697664
31744 30.793042
32768 31.930113
33792 32.983547
34816 33.866674
35840 35.088346
36864 36.339884
Does that correspond to your findings?
If I read your code correctly, yes. Given that the file has about 682K bytes, it would take very long to render that file, orders of magnitude slower than the other libraries benchmarked.
I assume the performance of text wrapping in borb
would indeed continue to scale linearly with the input size. Hence 682K bytes ought to take roughly 680 seconds. Making borb
37 to 38 times slower than the currently slowest library you have tried.
I am not concerned about this at this point in time, for two reasons:
borb
with speed explicitly in mind. For me, ease of use and number of features has always been priority. My goal is to make it as easy to create a PDF as it is to create a Microsoft Word document.I am going to keep the test, and monitor performance. I may even investigate and profile the code. But as I mentioned before, it is currently not my priority.
Kind regards, Joris Schellekens
Thanks for the explanation!
Linear scaling O(n) would actually be very good, so if you manage to improve the line wrapping algorithm itself, that would certainly help!
Even though one might not add 11k lines in one document, the performance of borb seems to be similar for the case where one would add 100 lines in 100 different documents. So this would also limit the usage of borb in batch processing, say, when creating invoices or salary statements en masse.
All the best!
Hi there,
I revisited the example code I showed you earlier.
Come to think of it, it would make more sense (to improve speed at least) to pre-load the Font
.
By not specifying a Font
in the Paragraph
constructor, you are asking Paragraph
to use its default (Helvetica).
That's all fine and well, but it will keep loading the Font
every single time you construct the Paragraph
.
That, combined with the performance gains from the previous work on LayoutElement
gives me the following output:
1024 0.345911
2048 0.756442
3072 1.189494
4096 1.674580
5120 2.080195
6144 2.518857
7168 3.189230
8192 3.789248
9216 3.882730
I hope this is already somewhat more acceptable. It's certainly an improvement compared to the earlier numbers.
Kind regards, Joris Schellekens
@jorisschellekens Great, that looks much better now, a 3x improvement!
@jorisschellekens Hi there! I recently implemented table support in HexaPDF and thought about including borb in the table benchmark (https://github.com/gettalong/hexapdf/tree/devel/benchmark/table/).
I basically followed the table example from borb-examples repository (https://github.com/jorisschellekens/borb-examples#321-fixedcolumnwidthtable). However, similar to the line wrapping benchmark borb is very slow. It takes about 2 seconds for rendering a PDF with a table that has 3 columns and 10 rows where the middle column has an image (always the same one). This is about 20x slower than reportlab or 12x slower than fpdf2.
Just wanted to let you know if this is something you want to improve.
Hi there,
I'm trying to include borb in my line wrapping benchmark at https://hexapdf.gettalong.org/documentation/benchmarks/line_wrapping.html. The benchmark tests the automatic document layout facility of a library, laying out the text of Homer's Odyssey on pages with a height of 1000pt and widths of 400pt, 200pt, 100pt, and 50pt.
With the help of the examples and digging through the code itself, I managed to create a script that executes successfully but runs so long that I interrupted it. I'm not even sure at this point if borb automatically wraps boxes into new pages.
My first try was to add a single
Paragraph(text)
to theSingleColumnLayout
but when it took too long, I tried usingParagraph(line)
objects, for each line of the input file; still took to long (Why I did this? Because reportlab takes very long when presented with a single large paragraph but is fast when presented with individual lines).Here is my current code:
Is slow text output (or performance in general) a known issue? Or am I just using borb in the wrong way?