Poor performance when creating a big presentation

wandererfrog commented 4 years ago

We have an application creating large pptx's with over 1000 slides and we are using python-pptx library.

The problem we have is that, as the Presentation grows it becomes slower to add Elements and/or charts to it.

from pptx import Presentation
from pptx.chart.data import CategoryChartData
from pptx.enum.chart import XL_CHART_TYPE
from pptx.util import Inches

SLD_LAYOUT_TITLE_AND_CONTENT = 1

prs = Presentation()

slide_layout = prs.slide_layouts[SLD_LAYOUT_TITLE_AND_CONTENT]
for idx in range(2000):
    slide = prs.slides.add_slide(prs.slide_layouts[5])

    chart_data = CategoryChartData()
    chart_data.categories = ['East', 'West', 'Midwest']
    chart_data.add_series('Series 1', (19.2, 21.4, 16.7))

    x, y, cx, cy = Inches(2), Inches(2), Inches(6), Inches(4.5)
    slide.shapes.add_chart(
    XL_CHART_TYPE.COLUMN_CLUSTERED, x, y, cx, cy, chart_data
    )

    print(str(idx))

prs.save('test.pptx')

I wonder if anyone has come across this situation before? It seems that pptx-python has to lookup inside the Presentation thus making it slower per iteration. Or is it the way we are using python to loop and load the variables into memory?

scanny commented 4 years ago

It looks like the bottleneck is here: https://github.com/scanny/python-pptx/blob/master/pptx/opc/package.py#L105-L117 where the part-name for the new chart (and slide) is being generated.

This could be optimized by caching the last part number assigned and simply incrementing it. It would probably need to be a dictionary like: {"slide": 27, "chart": 18, ...} because each part-types has its own sequence, i.e. there can be both a slide42.xml and a chart42.xml but not two chart15.xml parts. This cache could be stored as an instance variable of the Package object which I believe is a singleton for each presentation.

Quicker and dirtier might be to just assign a GUID instead of the number. That wouldn't involve cacheing. It would make your partnames pretty ugly but those only appear to a developer doing diagnostics for the most part. I believe the only actual constraint is that they be unique.

You're probably going to need to patch a fork and use that if you're in any kind of a rush.

MartinPacker commented 4 years ago

Sometimes with these sorts of things inserting backwards works. i.e. from last slide to first, rather than the other way around.

Does it in this case?

Cheers, Martin

Martin Packer

Systems Investigator & Performance Troubleshooter, IBM

+44-7802-245-584

email: martin_packer@uk.ibm.com

Twitter / Facebook IDs: MartinPacker

Blog: https://mainframeperformancetopics.com

Mainframe, Performance, Topics Podcast Series (With Marna Walle): https://anchor.fm/marna-walle

Youtube channel: https://www.youtube.com/channel/UCu_65HaYgksbF6Q8SQ4oOvA

From: Steve Canny notifications@github.com To: scanny/python-pptx python-pptx@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Date: 01/09/2020 19:31 Subject: [EXTERNAL] Re: [scanny/python-pptx] Poor performance when creating a big presentation (#644)

It looks like the bottleneck is here: https://github.com/scanny/python-pptx/blob/master/pptx/opc/package.py#L105-L117 where the part-name for the new chart (and slide) is being generated. This could be optimized by caching the last part number assigned and simply incrementing it. It would probably need to be a dictionary like: {"slide": 27, "chart": 18, ...} because each part-types has its own sequence, i.e. there can be both a slide42.xml and a chart42.xml but not two chart15.xml parts. This cache could be stored as an instance variable of the Package object which I believe is a singleton for each presentation. Quicker and dirtier might be to just assign a GUID instead of the number. That wouldn't involve cacheing. It would make your partnames pretty ugly but those only appear to a developer doing diagnostics for the most part. I believe the only actual constraint is that they be unique. You're probably going to need to patch a fork and use that if you're in any kind of a rush. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number

Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

scanny commented 4 years ago

I don't think so @MartinPacker, this appears to be a clear O(N^2) problem to me. It's usually not a problem (never reported before) because your typical PPTX is well less than 100 slides.

The original code recalculates the next available part "number" each time, which is the most robust way to do it since in general the only reliable state is the contents of the underlying XML. Proxy objects (that manipulate that XML) come and go and in general are not stateful.

The exception to the coming and going characteristic is the Package object which is the same the entire time the PPTX file is in memory. So I'd consider it safe enough to cache the next available partname there where it can simply be incremented each time.

scanny / python-pptx

Poor performance when creating a big presentation #644