scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.57k stars 465 forks source link

importing dateparser is too slow #1051

Open anarcat opened 2 years ago

anarcat commented 2 years ago

hi!

first, thanks for this awesome project, it's really useful and powerful and i am grateful to not have to write this stuff myself. :)

i open this issue because I feel there's some inherent performance issue to be paid whenever we even load the dateparser library:

anarcat@curie:undertime(main)$ multitime -n 10 -s 0 -q python3 -c "import dateparser"
===> multitime results
1: -q python3 -c "import dateparser"
            Mean        Std.Dev.    Min         Median      Max
real        0.328       0.008       0.319       0.326       0.350       
user        0.313       0.009       0.299       0.315       0.331       
sys         0.013       0.008       0.000       0.014       0.028       

compare with similar libraries:

anarcat@curie:undertime(main)$ multitime -n 10 -s 0 -q python3 -c "import parsedatetime"
===> multitime results
1: -q python3 -c "import parsedatetime"
            Mean        Std.Dev.    Min         Median      Max
real        0.072       0.008       0.069       0.070       0.096       
user        0.065       0.011       0.050       0.062       0.095       
sys         0.008       0.005       0.000       0.008       0.019       
anarcat@curie:undertime(main)$ multitime -n 10 -s 0 -q python3 -c "import arrow"===> multitime results
1: -q python3 -c "import arrow"
            Mean        Std.Dev.    Min         Median      Max
real        0.064       0.006       0.061       0.062       0.081       
user        0.055       0.006       0.042       0.054       0.064       
sys         0.009       0.006       0.000       0.010       0.019       

a quick profiling seems to show it spends an inordinate amount of time compiling regular expressiongs:

anarcat@curie:undertime(main)$ python3 -m cProfile -s cumulative <(echo "import dateparser") | head -50
         627961 function calls (598845 primitive calls) in 0.558 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        9    0.000    0.000    0.593    0.066 __init__.py:1(<module>)
     60/1    0.000    0.000    0.558    0.558 {built-in method builtins.exec}
        1    0.000    0.000    0.558    0.558 63:1(<module>)
     73/1    0.000    0.000    0.558    0.558 <frozen importlib._bootstrap>:1002(_find_and_load)
     73/1    0.000    0.000    0.558    0.558 <frozen importlib._bootstrap>:967(_find_and_load_unlocked)
     70/1    0.000    0.000    0.557    0.557 <frozen importlib._bootstrap>:659(_load_unlocked)
     58/1    0.000    0.000    0.557    0.557 <frozen importlib._bootstrap_external>:784(exec_module)
     93/1    0.000    0.000    0.557    0.557 <frozen importlib._bootstrap>:220(_call_with_frames_removed)
        1    0.000    0.000    0.556    0.556 date.py:1(<module>)
        1    0.000    0.000    0.519    0.519 date_parser.py:1(<module>)
        1    0.000    0.000    0.500    0.500 timezone_parser.py:1(<module>)
     1901    0.034    0.000    0.485    0.000 regex.py:451(_compile)
      795    0.004    0.000    0.477    0.001 regex.py:349(compile)
      770    0.002    0.000    0.357    0.000 timezone_parser.py:56(build_tz_offsets)
      769    0.006    0.000    0.345    0.000 timezone_parser.py:58(get_offset)
 2255/755    0.008    0.000    0.159    0.000 _regex_core.py:382(_parse_pattern)

basically, it seems we're spending a lot of time compiling regular expressions. individually, those don't matter so much (percall=1ms) but we seem to be doing hundreds of those. I think it might be related to the timezone_parser.py file (build_tz_offsets?) but i stopped digging there.

the exact source is a little besides the point: shouldn't just importing the module be safe enough, performance wise? i know we load a default parser, but that's not what's eating us here, but rather a bunch of globals in timezone_parser.py... it seems to me those could be lazily loaded, at least?

anarcat commented 2 years ago

oh and in case you're wondering why this matters to me, it's because i wrote this tool called undertime who gives you different times in different zones, as a one-shot commandline tool. most of its time is spent building those regexes it doesn't use. :)

i'm now lazily loading dateparser itself, but the user can definitely "feel" when it hits that corner case.

mlissner commented 1 year ago

We use a lot of data objects in our libraries that usually load from json and moving them to lazy-load instead of load-on-import has been helpful. It's not too hard and it's been reliable for us.