plasma-umass / scalene

Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals
Apache License 2.0
11.36k stars 385 forks source link

lxml triggers: <cyfunction _Attrib.get at ...> returned a result with an exception set #649

Open mbollmann opened 11 months ago

mbollmann commented 11 months ago

Describe the bug I am trying to profile one of my own libraries with Scalene, but am reproduceably running into an exception that I don't understand, which is as follows:

Error in program being profiled:
 <cyfunction _Attrib.get at 0x7f88c605c670> returned a result with an exception set
TypeError: 'NoneType' object cannot be interpreted as an integer

The above exception was the direct cause of the following exception:

SystemError: <class 'bytearray'> returned a result with an exception set

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/bollmann/.cache/pypoetry/virtualenvs/acl-anthology-py-Csl04KZN-py3.11/lib64/python3.11/site-packages/scalene/scalene_profiler.py", line 1857, in profile_code
    exec(code, the_globals, the_locals)
  File "/home/bollmann/repos/acl-anthology-py/mytest_scalene.py", line 2, in <module>
    Anthology(datadir="tests/toy_anthology").people.build()
  File "/home/bollmann/repos/acl-anthology-py/acl_anthology/people/index.py", line 150, in build
    for volume in collection:
  File "/home/bollmann/repos/acl-anthology-py/acl_anthology/collections/collection.py", line 48, in __iter__
    self.load()
  File "/home/bollmann/repos/acl-anthology-py/acl_anthology/collections/collection.py", line 92, in load
    current_volume._add_paper_from_xml(element)
  File "/home/bollmann/repos/acl-anthology-py/acl_anthology/collections/volume.py", line 165, in _add_paper_from_xml
    paper = Paper.from_xml(self, element)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bollmann/repos/acl-anthology-py/acl_anthology/collections/paper.py", line 183, in from_xml
    if (ingest_date := paper.attrib.get("ingest-date")) is not None:
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SystemError: <cyfunction _Attrib.get at 0x7f88c605c670> returned a result with an exception set

In the line that triggers the exception, paper is an lxml.etree._Element. Without Scalene, the program runs without errors.

To Reproduce See comment below for a minimal working example Steps to reproduce the behavior:

  1. Clone https://github.com/mbollmann/acl-anthology-py.git

  2. Save the following script as test.py:

    from acl_anthology import Anthology
    Anthology(datadir="tests/toy_anthology").people.build()
  3. Run poetry run scalene test.py

  4. Most of the time, this triggers the exception. (Occasionally I don't see the exception on the terminal, but the profile data on the web GUI looks just as incomplete as when the exception comes up.)

~I've tried to get a MWE that triggers the exception, but wasn't successful yet. I can try more to isolate the XML parsing parts of my library to see if I can find one.~ See comment below for a minimal working example

Expected behavior Scalene should run the program without an exception, just as it does when it is run without Scalene.

Desktop (please complete the following information):

mbollmann commented 11 months ago

Minimal working example:

from lxml import etree

filename = "/home/bollmann/repos/acl-anthology-py/tests/toy_anthology/xml/2022.acl.xml"

for event, element in etree.iterparse(filename):
    if element.tag == "paper":
        pass

Inspecting an attribute of the returned element seems to be crucial for triggering an exception.

Instructions

  1. Download this XML file: https://github.com/mbollmann/acl-anthology-py/blob/main/tests/toy_anthology/xml/2022.acl.xml — I suspect it works with any XML file that's long enough, it doesn't trigger with any XML file.
  2. Run the above script with Scalene (adapting the filename string to your machine).
  3. I get this:
Error in program being profiled:
 <class 'StopIteration'> returned a result with an exception set
TypeError: 'NoneType' object cannot be interpreted as an integer

The above exception was the direct cause of the following exception:

SystemError: <class 'bytearray'> returned a result with an exception set

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/bollmann/.cache/pypoetry/virtualenvs/acl-anthology-py-Csl04KZN-py3.11/lib64/python3.11/site-packages/scalene/scalene_profiler.py", line 1857, in profile_code
    exec(code, the_globals, the_locals)
  File "/home/bollmann/repos/acl-anthology-py/mytest_scalene.py", line 6, in <module>
    for event, element in etree.iterparse(filename):
  File "src/lxml/iterparse.pxi", line 187, in lxml.etree.iterparse.__next__
  File "src/lxml/saxparser.pxi", line 275, in lxml.etree._ParseEventsIterator.__next__
SystemError: <class 'StopIteration'> returned a result with an exception set

Note that this exception is triggered from a different place than what I reported in the OP; the exact line that triggers an exception varies depending on how I modify my code, but it's always "[something] returned a result with an exception set".

Library versions

This is tested with lxml==4.9.3. The full environment is in https://github.com/mbollmann/acl-anthology-py/blob/b82f8220f066a95608ee5ddff34f902da95bd455/poetry.lock