Huge performance problem: `DateDataParser` stored multiple duplicate `previous_locales`

bzamecnik commented 6 years ago

We got surprised that dateparser.parse() gets slower over time (in a long-running process ~1 day), from hundereds of ms to 3 seconds per call! One possible cause is the following:

DateDataParser.get_date_data() stores some previously used locales:

if self.try_previous_locales:
    self.previous_locales.insert(0, locale)

However, we observed that dateparser._default_parser.previous_locales contained many duplicate instances of the same set of locales. Eg. 2189 instances of 63 unique locales.

The problem with the code above is that self.previous_locales should be a set, possibly ordered, but there's no check that locale is already inside.

Anyway, even when we run dateparser.parse() it's pretty slow (~400 ms / item). However with only languages=['en] it's really fast (~0.25 ms / item).

Package version: latest 0.7.0, Python 2.7.

lopuhin commented 6 years ago

Hey @bzamecnik thanks for the bug report 👍 I think this is an important issue for us as well (although I didn't try to reproduce it yet). I wanted to clarify just one thing:

Anyway, even when we run dateparser.parse() it's pretty slow (~400 ms / item). However with only languages=['en] it's really fast (~0.25 ms / item).

Does this also happen after some time, or right from the start? If it happens right away, then this PR https://github.com/scrapinghub/dateparser/pull/428 has some performance improvements that are already in master but not released to PyPI yet.

bzamecnik commented 6 years ago

Thanks for a quick reply. In one of the smaller runs with 500 strings, it happened right away. When we restricted languages to just ['en'] it was fast. The results of the mentioned PR look good. Should I make a small PR for this or you can add the one line with the check and run measurements yourself within that PR #428?

lopuhin commented 6 years ago

@bzamecnik if you can make a PR, that would be awesome 👍

bzamecnik commented 6 years ago

OK, I can prepare that in the evening hopefully.

htInEdin commented 5 years ago

Just to confirm that a combination of pulling the #428 master and adding a minimal one line patch per bzamecnik (attached) helped cut a 6 1/2 day run, involving millions of dates, to 25 minutes!

date.txt

lopuhin commented 5 years ago

Thanks, great suff @htInEdin ! I hope you and @bzamecnik don't mind if we take @bzamecnik 's patch and make a PR with that? Or you can do it yourself if you wish (also maybe it could be further improved to avoid having to search in a list and to insert in the beginning).

htInEdin commented 5 years ago

Konstantin Lopuhin writes:

Thanks, great suff @htInEdin ! I hope you and @bzamecnik don't mind if we take @bzamecnik 's patch and make a PR with that? Or you can do it yourself if you wish (also maybe it could be further improved to avoid having to search in a list and to insert in the beginning).

No problem, please use @bzamecnik's or adapt/improve mine as you see fit.

Thanks for a great, and now faster, library!

ht -- Henry S. Thompson, School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ [mail from me always has a .sig like this -- mail without it is forged spam]

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

jfelectron commented 5 years ago

What's the status of this? dateparser is great, but unusable in large datasets with mixed locals.

lopuhin commented 5 years ago

@jfelectron that particular patch is not integrated and has no PR at the moment.

In general master has many other speed improvements, soon to be released: https://github.com/scrapinghub/dateparser/issues/494

jfelectron commented 5 years ago

@lopuhin OK, thanks.

iamtodor commented 5 years ago

UPD: Found a hack.

date_field_datetime = dateparser.parse(date_field, date_formats=["%Y-%m-%dT%H:%M:%S.",
                                                                         "%Y-%m-%dT%H:%M:%S.%f",
                                                                         "%Y-%m-%dT%H:%M:%S.%fZ"])

Hello, gents. I have the same issue with the latest version. I have checked traces with https://console.cloud.google.com/traces and there is a bad surprise. Here is a screen https://imgur.com/a/lsMHUzm. Somehow every 9th cycle in a loop takes huge time. Can I do something to prevent such behavior?

Eveneko commented 4 years ago

Hi, I'm interested in this idea. But the requirements and goals of this project on GSoC are a little bit specific. Can I get more specific information? Or does it all depend on what I think and what I want to achieve?

Gallaecio commented 4 years ago

Yes, this is a very specific task that can be part of the http://gsoc2020.scrapinghub.com/ideas#dateparser-performance idea, but the GSoC idea should include additional performance-driven changes, and probably thread-safety, as mentioned in the idea page.

If you want to work on a GSoC proposal for this, feel free to create a separate issue were we can list different performance improvements that could be handled as part of GSoC.

noviluni commented 4 years ago

Hi @Gallaecio the original issue has been fixed in the last 0.7.4 version (PR: https://github.com/scrapinghub/dateparser/pull/625). I think we should close this issue and keep the GSoC questions regarding the performance in different issues.

Let me know what you think and feel free to open this issue again.

scrapinghub / dateparser

Huge performance problem: `DateDataParser` stored multiple duplicate `previous_locales` #457