plasma-umass / scalene

Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals
Apache License 2.0
12.22k stars 399 forks source link

Revised regex for magics removal #875

Closed emeryberger closed 3 weeks ago

emeryberger commented 3 weeks ago

The previous regex (meant to eliminate Jupyter notebook 'magics'), in some edge cases, would cause parsing to fail (leading to an IndentationError). This PR updates the regex to prevent that from happening.

emeryberger commented 3 weeks ago

Fixes https://github.com/plasma-umass/scalene/issues/671.

emeryberger commented 3 weeks ago

For posterity: this code (from Python 3.12's socket.py) leads to the above-mentioned problem.

import _socket

class socket(_socket.socket):

    def __repr__(self):
        """Wrap __repr__() to reveal the real class name and socket                                                                                       
        address(es).                                                                                                                                      
        """
        closed = getattr(self, '_closed', False)
        s = "<%s.%s%s fd=%i, family=%s, type=%s, proto=%i" \
            % (self.__class__.__module__,
               self.__class__.__qualname__,
               " [closed]" if closed else "",
               self.fileno(),
               self.family,
               self.type,
               self.proto)

The culprit is the interaction between the backslash and the % on the following line. With the old regexp, part of this code was incorrectly transformed to the following (note that the % has been replaced by a #):

        s = "<%s.%s%s fd=%i, family=%s, type=%s, proto=%i" \
            # (self.__class__.__module__,

With the new regex, the % is correctly left unchanged.