scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.55k stars 465 forks source link

Quadratic complexity in search_dates #496

Closed lopuhin closed 4 years ago

lopuhin commented 5 years ago

Using current master (289fcc4ca936069b2c75bb5378b4136633b3edf9) we see quadratic complexity with respect to the number of date-like numbers in the input:

%time _ = search_dates(' '.join(str(2000 + i) for i in range(20)))
Wall time: 907 ms
%time _ = search_dates(' '.join(str(2000 + i) for i in range(40)))
Wall time: 3.4 s
%time _ = search_dates(' '.join(str(2000 + i) for i in range(80)))
Wall time: 14.4 s

Timings with languages=['en']) are the same.

https://github.com/scrapinghub/dateparser/issues/439 might be related

noviluni commented 4 years ago

Hi @lopuhin, I'm trying to reproduce this with the last version (dateparser==0.7.4) and I can't (furthermore, it takes less time to finish than older versions). It seems that it was fixed when we added some performance improvements in the last version.

Could you confirm that it's also fixed for you?

Thanks in advance :slightly_smiling_face:

lopuhin commented 4 years ago

Thanks @noviluni I can confirm that growth is linear now and constant is much smaller:

In [1]: from dateparser.search import search_dates

In [2]: %time _ = search_dates(' '.join(str(2000 + i) for i in range(20)))
CPU times: user 55.7 ms, sys: 3.99 ms, total: 59.7 ms
Wall time: 59.4 ms

In [3]: %time _ = search_dates(' '.join(str(2000 + i) for i in range(20)))
CPU times: user 31.1 ms, sys: 3.94 ms, total: 35 ms
Wall time: 34.6 ms

In [4]: %time _ = search_dates(' '.join(str(2000 + i) for i in range(40)))
CPU times: user 67.1 ms, sys: 122 µs, total: 67.2 ms
Wall time: 66.8 ms

In [5]: %time _ = search_dates(' '.join(str(2000 + i) for i in range(80)))
CPU times: user 135 ms, sys: 18 µs, total: 135 ms
Wall time: 135 ms

In [6]: %time _ = search_dates(' '.join(str(2000 + i) for i in range(160)))
CPU times: user 246 ms, sys: 16.1 ms, total: 262 ms
Wall time: 262 ms

Thank you! Closing