scanny / python-pptx

Create Open XML PowerPoint documents in Python
MIT License
2.26k stars 498 forks source link

Delete slides #956

Open luchux opened 3 months ago

luchux commented 3 months ago

I managed to remove slides from_page to_page (i.e. from_slide_number, to_slide_number). It works. The problem I'm facing is that although i remove relationships, the small PPTx version, has the same MB weight than the original. I can't find the way to remove the memory usage of those elements unlinked. If anybody could give me a hint, would be appreciated!

def _keep_slides_from_to(presentation, from_page, to_page):
    """Remove each slide position that is not in the range from_page to to_page"""
    idxs_to_remove = [
        pos
        for pos, slide in enumerate(presentation.slides._sldIdLst)
        if pos < from_page or pos > to_page
    ]
    xml_slides = presentation.slides._sldIdLst
    slides = list(xml_slides)
    rels = presentation.part.rels
    rel_ids_to_remove = [slides[idx].rId for idx in idxs_to_remove]

    for idx_to_remove in idxs_to_remove:
        slide_id = slides[idx_to_remove]
        xml_slides.remove(slide_id)

    # Remove the corresponding relationship
    for rel_id in rel_ids_to_remove:
        rels._rels.pop(rel_id)

    return presentation
MartinPacker commented 3 months ago

It strikes me that code hasn't actually removed any data - other than some of the XML. Essentially you've orphaned some parts.

I don't believe there's a general API for removing unwanted parts such as graphics from what is, after all, a zip file.

That is something I'd like to see - within python-pptx.

scanny commented 3 months ago

@MartinPacker if you remove a relationship I believe you'll find that the orphaned part is not saved. Also, if you don't have or retain a reference to the part instance then it will be garbage collected.

So there's no real reason to somehow destroy an orphaned part, that should take care of itself.

One possible problem though is when a part is related-to by more than one other part. I vaguely remember this being the case in some instances, like maybe a slide-layout being referenced by a slide-master and also being referenced by slides that use it. So you would need to remove all those relationships to actually orphan the slide-layout part.

Other parts, like images in particular, can be referenced by multiple parts on purpose, like if you rubber-stamped copies of an image on say 20 different slides for visual effect, maybe a logo, that image should only be stored once, even though there are 20 relationships to it from other parts (slides or slide-layouts maybe in this case).

MartinPacker commented 3 months ago

Thank you @scanny; I wasn't aware of that.