scanny / python-pptx

Create Open XML PowerPoint documents in Python
MIT License
2.39k stars 518 forks source link

ValueError: value must be in range 256 to 2147483647 inclusive, got 2147483648 - when adding slide #972

Closed sasensi closed 2 months ago

sasensi commented 4 months ago

Hello and first of all, thank you for the great work on this library which helps me a ton ! I'm using this library in order to write presentations based on existing presentations that user upload. So from time to time, I discoverer surprising edge cases like this one.

Here's the most reduced case that I could produce:

input_file = Path(file).parent / "_debug.pptx" presentation = Presentation(input_file) layout = presentation.slide_layouts[0] for _i in range(4): presentation.slides.add_slide(layout)

- here is the error:

Traceback (most recent call last): File "H:\workspaces\xd2sketch\slidespeak-fastapi\backend\app\new_pptx_writer_reproduce.py", line 9, in presentation.slides.add_slide(layout) File "H:\workspaces\xd2sketch\slidespeak-fastapi\venv\Lib\site-packages\pptx\slide.py", line 283, in add_slide self._sldIdLst.add_sldId(rId) File "H:\workspaces\xd2sketch\slidespeak-fastapi\venv\Lib\site-packages\pptx\oxml\presentation.py", line 56, in add_sldId return self._add_sldId(id=self._next_id, rId=rId) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\workspaces\xd2sketch\slidespeak-fastapi\venv\Lib\site-packages\pptx\oxml\xmlchemy.py", line 303, in _add_child setattr(child, key, value) File "H:\workspaces\xd2sketch\slidespeak-fastapi\venv\Lib\site-packages\pptx\oxml\xmlchemy.py", line 268, in set_attr_value str_value = self._simple_type.to_xml(value) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "H:\workspaces\xd2sketch\slidespeak-fastapi\venv\Lib\site-packages\pptx\oxml\simpletypes.py", line 26, in to_xml cls.validate(value) File "H:\workspaces\xd2sketch\slidespeak-fastapi\venv\Lib\site-packages\pptx\oxml\simpletypes.py", line 605, in validate cls.validate_int_in_range(value, 256, 2147483647) File "H:\workspaces\xd2sketch\slidespeak-fastapi\venv\Lib\site-packages\pptx\oxml\simpletypes.py", line 56, in validate_int_in_range raise ValueError( ValueError: value must be in range 256 to 2147483647 inclusive, got 2147483648


---

I investigated a bit and the issue happens here: https://github.com/scanny/python-pptx/blob/d043334b984736a7a2ade3fb6f9adcdd97b3e8f5/pptx/oxml/presentation.py#L67

What happens is that the value of `id_str_lst` in this case is `['2147483644']` and this serves as a base for next id calculation.
So ultimately, the next id ends up being over the maximal accepted value and produces the error.

I also saw that this part of the code was last modified here: https://github.com/scanny/python-pptx/commit/1fca0922891ee72dda72197a456ff37b494e2a58

So I tried monkey patching the method using the previous implementation and this fixes the issue:
```python
def _next_id_patched(self) -> int:
    id_str_lst = self.xpath("./p:sldId/@id")
    used_ids = [int(id_str) for id_str in id_str_lst]
    for n in range(256, 258 + len(used_ids)):
        if n not in used_ids:
            return n

CT_SlideIdList._next_id = property(_next_id_patched)

I'm wondering if there is maybe a way to mix both solutions to support even more cases like this one ? Or do you have any other idea about how to fix this (I can eventually help with the PR).

scanny commented 4 months ago

Hmm, interesting.

So this is slide-ids, right, which means there will usually be a modest number, say less that 100, and as many as 1000 would be extremely rare. So a linear-time algorithm is likely to be fast enough.

The max(sldIds) + 1 approach reduces the chance of using a deleted slide id. I vaguely remember there being a problem with that but can't remember it specifically. Perhaps someone was deleting slides and then a new slide was being assigned the same id and that caused some sort of problem.

So the max(sldIds) + 1 approach is probably preferable, but is fallible as we see here. So there could be a test and fallback to the next_available_id_gt_256 algorithm and that should solve this problem at least.

I'd say something like this would be a good approach and more efficient than the original:

used_ids = sorted(int(id_str) for id_str in self.xpath("./p:sldId/@id"))
id_count = len(used_ids)
for idx, n in enumerate(range(256, 256+id_count)):
    if used_ids[idx] != n:
            return n
return 256+id_count
case 1: all slide-ids are used in order
---------------------------------------
used_ids = [256, 257, 258, 259]
      ns = [256, 257, 258, 259]
matches all, returns 256+4=260

case 2: there is a gap in slide-ids
-----------------------------------
used_ids = [256, 258, 259, 260]
      ns = [256, 257, 258, 259]
mismatch on used_ids[1], returns n=257

It's not a lot more efficient, but at least is O(NlogN) instead of O(N^2).

MartinPacker commented 4 months ago

Is the sort going to happen frequently, @scanny? In which case maintaining a sorted list might be cheaper.

scanny commented 4 months ago

It could potentially happen frequently, if you were adding thousands of slides for example.

The problem with cacheing values like that is that the XML is the actual authoritative source and the two could easily become out-of-sync. So there's no alternative to reading in the list every time I'm afraid. For example where would you store the cached list? A class has module lifetime so if you loaded more than one presentation on a single run it would be wrong. And an instance like Slide or even Presentation is just a proxy. You can easily have more than one Slide instance pointing to the same slide element subtree in the XML. So there's no reliable location for this sort of thing other than the XML.

Also, the fall-back algorithm (which is what requires sorting) would only be engaged when the max()+1 algorithm failed, which has taken these ten years to show up for the first time. So I don't expect we'd get any complaints :)

sasensi commented 4 months ago

Thank you for your insightful answers, seems like a combination of the 2 approaches is the ultimate solution 😄 So do you think that this can be fixed as part of the library in a soon future or should I better go with a monkey patch for now ? As mentioned initially, I can also work on a PR if it helps ?

scanny commented 4 months ago

I think a monkey-patch is a great first step. The code would be the same as in a PR and it would be a cut-and-paste solution anyone who was facing this problem could add to their own code for an immediate solution.

A PR is probably not worth the trouble I expect. Although if you wanted to add the tests for it that might be worthwhile. Generally all a PR does for me is give me some idea of an approach that appears to work as a starting point for the actual implementation. To maintain the industrial-grade quality of the package any change needs to go in with full tests and documentation where required, and PRs generally lack those broader software engineering aspects.

sasensi commented 4 months ago

Ok, thank you, so I went with this monkey patch that tries the current implementation and fallback to the old one (your improved version):

from pptx.oxml import CT_SlideIdList
from pptx.oxml.simpletypes import ST_SlideId

def _next_id_patched(self) -> int | None:  # noqa: ANN001
    id_str_lst = self.xpath("./p:sldId/@id")
    next_id = max([255] + [int(id_str) for id_str in id_str_lst]) + 1
    if ST_SlideId.validate(next_id):
        return next_id
    used_ids = sorted(int(id_str) for id_str in id_str_lst)
    id_count = len(used_ids)
    for idx, n in enumerate(range(256, 256 + id_count)):
        if used_ids[idx] != n:
            return n
    return 256 + id_count

CT_SlideIdList._next_id = property(_next_id_patched)  # noqa: SLF001

Do you prefer to keep the issue opened or closed ?

scanny commented 4 months ago

Let's keep it open @sasensi. We might add this in a later release and seems maybe easier for folks to find an open issue than a closed one if they're seeing this problem :)

ankitjpec commented 3 months ago

having the same issue - can you share where did you update this code as i cant seem to get this error fixed.

sasensi commented 3 months ago

having the same issue - can you share where did you update this code as i cant seem to get this error fixed.

In my case, I made sure to run the code posted here: https://github.com/scanny/python-pptx/issues/972#issuecomment-2109434288 to patch the library before I use it.
And then the error was gone.

ankitjpec commented 3 months ago

that works ! thanks.

scanny commented 3 months ago

Any forensics on this? Like an idea how these particular presentations get slides with large ids like this?

I'm vaguely suspecting some plugin that assigns ids starting from the maximum number downward as a simple-minded way to avoid collisions with existing slides.

Can you describe the provenance of PPTX files where you've noticed this behavior?

scanny commented 3 months ago

I'm advancing this to a shortlist item given it appears it's not as much of a fluke as originally suspected :)

ankitjpec commented 3 months ago

Any forensics on this? Like an idea how these particular presentations get slides with large ids like this?

I'm vaguely suspecting some plugin that assigns ids starting from the maximum number downward as a simple-minded way to avoid collisions with existing slides.

Can you describe the provenance of PPTX files where you've noticed this behavior?

i spent an hr trying to read the ID's and found related reason is that if you have a lot of master slides then an old ppt can have indexes that go beyond the max limit. manual copy paste of slides from 1 ppt to another doesnt refresh the slide ID and you really need VBA code to reset the slide ID numbers. even if you only have 5 slides in the deck but 10 master slides the indexing gets screwed up. make sense?

alainpannetier commented 3 weeks ago

Any forensics on this? Like an idea how these particular presentations get slides with large ids like this? I'm vaguely suspecting some plugin that assigns ids starting from the maximum number downward as a simple-minded way to avoid collisions with existing slides. Can you describe the provenance of PPTX files where you've noticed this behavior?

i spent an hr trying to read the ID's and found related reason is that if you have a lot of master slides then an old ppt can have indexes that go beyond the max limit. manual copy paste of slides from 1 ppt to another doesnt refresh the slide ID and you really need VBA code to reset the slide ID numbers. even if you only have 5 slides in the deck but 10 master slides the indexing gets screwed up. make sense?

Yes. Patching the template sldId works and you don't have to patch the library. Incidentally, I now understand why Microsoft Powerpoint also refuses to add/duplicate slides in some cases (no message from MS ppt). Thx all.

scanny commented 3 weeks ago

@alainpannetier this problem should be fixed in the latest version v1.0.2. Let me know if you're still seeing a problem. I can see why you might still want to update the slide-ids though :)

Interesting observation about PowerPoint sometimes not allowing more slides to be added.

alainpannetier commented 3 weeks ago

Thanks. @scanny Here is a couple of quick and dirty scripts to check and to reset (rebase) the Slide-Ids from 256 onwards in a given pptx presentation.

pptx-slideid-hacks.zip

run:

#!/usr/bin/env python3

import argparse
import traceback
from io import BytesIO
from zipfile import ZipFile, ZIP_DEFLATED
import lxml.etree as ET

xslt_src = """<xsl:stylesheet
    version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:p="http://schemas.openxmlformats.org/presentationml/2006/main"
    xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
    >

    <xsl:output method="xml" />
    <xsl:strip-space elements="*"/>

    <xsl:key name="mapping" match="/p:presentation/p:sldIdLst/p:sldId" use="@id"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="/p:presentation/p:sldIdLst/p:sldId">
        <p:sldId id="{256 + count(key('mapping', @id)/preceding-sibling::*)}" r:id="{@r:id}"/>
    </xsl:template>

    <!-- reuse same new ids in sections -->
    <xsl:template match="*[local-name()='sldId' and not (name()=p:sldId)]">
        <p:sldId id="{256 + count(key('mapping', @id)/preceding-sibling::*)}"/>
    </xsl:template>

</xsl:stylesheet>"""

def parse_arguments():
    parser = argparse.ArgumentParser(description='Renumbers slide Ids of a pptx Powerpoint presentation')
    parser.add_argument('-i', '--input',  help='Powerpoint input file',  dest='input',  required=True)
    parser.add_argument('-o', '--output', help='Powerpoint output file', dest='output', required=True)

    return parser.parse_args()

def read_zip(file_name):
    bio = BytesIO(open(file_name, 'rb').read())
    archive = ZipFile(bio, 'r')
    content = {n: archive.read(n) for n in archive.namelist()}
    archive.close()

    return content

def write_zip(file_name, content):
    bio = BytesIO()
    archive = ZipFile(bio, 'w', ZIP_DEFLATED)
    for name, data in content.items():
        archive.writestr(name, data)
    archive.close()
    open(file_name, 'wb').write(bio.getvalue())

def reset_slide_ids(content):
    input_dom = ET.fromstring(content['ppt/presentation.xml'])
    transformer = ET.XSLT(ET.fromstring(xslt_src))
    content['ppt/presentation.xml'] = ET.tostring(transformer(input_dom), encoding='utf-8')

    return content

def main():
    try:
        args = parse_arguments()
    except Exception:
        print("Error parsing command line")
        traceback.print_exc()
        return

    try:
        write_zip(args.output, reset_slide_ids(read_zip(args.input)))
    except Exception:
        print("Error rewriting zipfile")
        traceback.print_exc()
        return

if __name__ == "__main__":

    main()