replace tokenize with ast to parse the source code

arnaud-ma commented 2 months ago

Instead of parsing by hand token by token, ast returns a python object that just needs to be read. We can still make mistakes when reading this object, but no more inevitable edge cases from parsing the source code.

Closes #97 Closes #130

swansonk14 commented 2 weeks ago

Hi @arnaud-ma,

Thank you so much for your PR! We really like your ast approach to resolving the multiline assignment issue.

After studying this PR, we realized that some of the ideas in this PR could be applied with fewer changes. This is implemented in: https://github.com/swansonk14/typed-argument-parser/pull/148.

Best, Jesse and Kyle

arnaud-ma commented 2 weeks ago

Hi, thanks for the update! The idea behind this PR was to completely replace the use of tokenize with ast. This solves all possible bugs with code parsing. I also think (to test) that there should be a speed up in terms of performance. Moreover in the new PR, the source code is read twice (with inspect.getsource): once for tokenize and once for ast. And it is this reading that is (by far) the most expensive in performance. But I understand that what i proposed was too much change at once

martinjm97 commented 2 weeks ago

Oh, very interesting! I'd be interested in knowing whether AST resolves: https://github.com/swansonk14/typed-argument-parser/issues/45. Either you/(Kyle and I) can compare the performance of code using only ast to the current code. If it resolves performance issues, then I think that probably justifies a larger code change. Sorry we missed this aspect of your PR. Thanks again for the contribution.

arnaud-ma commented 2 weeks ago

Given this code where NewTap is from my PR and Tap is from before #148:

import timeit

from tap import Tap
from tap.tap import NewTap
from tap.utils import _old_get_class_variables, get_class_variables

class SimpleArgumentParser(Tap):
    name: str  # Your name
    language: str = "Python"  # Programming language
    package: str = "Tap"  # Package name
    stars: int  # Number of stars
    max_stars: int = 5  # Maximum stars

class NewSimpleArgumentParser(NewTap):
    name: str  # Your name
    language: str = "Python"  # Programming language
    package: str = "Tap"  # Package name
    stars: int  # Number of stars
    max_stars: int = 5  # Maximum stars

if __name__ == "__main__":
    number = 1000
    number_cls = 100

    t_new_cls = timeit.timeit(
        "NewSimpleArgumentParser()",
        globals=globals(),
        number=number_cls,
    )

    t_old_cls = timeit.timeit(
        "SimpleArgumentParser()",
        globals=globals(),
        number=number_cls,
    )

    t_new = timeit.timeit(
        "get_class_variables(SimpleArgumentParser)",
        globals=globals(),
        number=number,
    )
    t_old = timeit.timeit(
        "_old_get_class_variables(SimpleArgumentParser)",
        globals=globals(),
        number=number,
    )

    increase = (t_old - t_new) / t_old
    increase_cls = (t_old_cls - t_new_cls) / t_old_cls
    print(f"time new get_class_variables: {t_new / number * 1000:.4f} ms per call")
    print(f"time old get_class_variables: {t_old / number_cls * 1000:.4f} ms per call")
    print(f"new is {increase * 100:.2f}% faster")

    print(f"time new class init: {t_new_cls / number * 1000:.4f} ms per call")
    print(f"time old class init: {t_old_cls / number_cls * 1000:.4f} ms per call")
    print(f"new is {increase_cls * 100:.2f}% faster")

I get:

time new get_class_variables: 0.3262 ms per call
time old get_class_variables: 0.5686 ms per call
new is 42.63% faster
time new class init: 17.0593 ms per call
time old class init: 21.7422 ms per call
new is 21.54% faster

It's better, but it won't solve #45 since it is inspect.getsource that takes the majority of the remaining time

arnaud-ma commented 2 weeks ago

I made a specific PR (#149) related to performance, you can ignore this one

martinjm97 commented 2 weeks ago

Perfect! Thank you for the update! Kyle and I will discuss it next time we meet. I approved running the tests.

swansonk14 / typed-argument-parser

replace tokenize with ast to parse the source code #144