jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.
https://borbpdf.com/
Other
3.39k stars 147 forks source link

`from_pandas_dataframe` is very slow #212

Open kerim371 opened 3 weeks ago

kerim371 commented 3 weeks ago

Hi,

Thank you for the library, it is very easy to use!

I'm trying to write pandas.Dataframe with n*1000 rows using the following code:

table = TableUtil.from_pandas_dataframe(df, flexible_column_width=False, font_size=Decimal(8), round_to_n_digits=2)

but it is incredibly slow.

I added here a print(i) and I noticed that it starts pretty fast but it becomes slower and slower and after 100 iterations it becomes unbearably slow.

Is there a way to increase the speed of table creation?

jorisschellekens commented 3 weeks ago

You can try:

I'm curious about the first option though. I wonder why the time complexity would not behave linearly.

Kind regards, Joris Schellekens

kerim371 commented 3 weeks ago

@jorisschellekens hi,

Thank you for the response.

Splitting the dataframe works and that is the current solution that I use.

About the profiling, I have fihured out that the slowest part is t.add(p). Haven't done any further investigations.

Is it possible to make a list of cells and then add those cells at once? Like:

p_list = [p1, p2, p3,...]
t.add(p_list)
jorisschellekens commented 1 week ago

You can certainly create a list of LayoutElement objects ahead of time, but there is no method to add them all in bulk.

kerim371 commented 1 week ago

Thank you for response. Ok, splitting dataframe works pretty good. That is enough for the now.