Closed bzamecnik closed 4 years ago
Hey @bzamecnik thanks for the bug report 👍 I think this is an important issue for us as well (although I didn't try to reproduce it yet). I wanted to clarify just one thing:
Anyway, even when we run dateparser.parse() it's pretty slow (~400 ms / item). However with only languages=['en] it's really fast (~0.25 ms / item).
Does this also happen after some time, or right from the start? If it happens right away, then this PR https://github.com/scrapinghub/dateparser/pull/428 has some performance improvements that are already in master but not released to PyPI yet.
Thanks for a quick reply. In one of the smaller runs with 500 strings, it happened right away. When we restricted languages to just ['en']
it was fast. The results of the mentioned PR look good. Should I make a small PR for this or you can add the one line with the check and run measurements yourself within that PR #428?
@bzamecnik if you can make a PR, that would be awesome 👍
OK, I can prepare that in the evening hopefully.
Just to confirm that a combination of pulling the #428 master and adding a minimal one line patch per bzamecnik (attached) helped cut a 6 1/2 day run, involving millions of dates, to 25 minutes!
Thanks, great suff @htInEdin ! I hope you and @bzamecnik don't mind if we take @bzamecnik 's patch and make a PR with that? Or you can do it yourself if you wish (also maybe it could be further improved to avoid having to search in a list and to insert in the beginning).
Konstantin Lopuhin writes:
Thanks, great suff @htInEdin ! I hope you and @bzamecnik don't mind if we take @bzamecnik 's patch and make a PR with that? Or you can do it yourself if you wish (also maybe it could be further improved to avoid having to search in a list and to insert in the beginning).
No problem, please use @bzamecnik's or adapt/improve mine as you see fit.
Thanks for a great, and now faster, library!
ht -- Henry S. Thompson, School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ [mail from me always has a .sig like this -- mail without it is forged spam]
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
What's the status of this? dateparser is great, but unusable in large datasets with mixed locals.
@jfelectron that particular patch is not integrated and has no PR at the moment.
In general master has many other speed improvements, soon to be released: https://github.com/scrapinghub/dateparser/issues/494
@lopuhin OK, thanks.
UPD: Found a hack.
date_field_datetime = dateparser.parse(date_field, date_formats=["%Y-%m-%dT%H:%M:%S.",
"%Y-%m-%dT%H:%M:%S.%f",
"%Y-%m-%dT%H:%M:%S.%fZ"])
Hello, gents. I have the same issue with the latest version. I have checked traces with https://console.cloud.google.com/traces
and there is a bad surprise.
Here is a screen https://imgur.com/a/lsMHUzm. Somehow every 9th cycle in a loop takes huge time.
Can I do something to prevent such behavior?
Hi, I'm interested in this idea. But the requirements and goals of this project on GSoC are a little bit specific. Can I get more specific information? Or does it all depend on what I think and what I want to achieve?
Yes, this is a very specific task that can be part of the http://gsoc2020.scrapinghub.com/ideas#dateparser-performance idea, but the GSoC idea should include additional performance-driven changes, and probably thread-safety, as mentioned in the idea page.
If you want to work on a GSoC proposal for this, feel free to create a separate issue were we can list different performance improvements that could be handled as part of GSoC.
Hi @Gallaecio the original issue has been fixed in the last 0.7.4
version (PR: https://github.com/scrapinghub/dateparser/pull/625). I think we should close this issue and keep the GSoC questions regarding the performance in different issues.
Let me know what you think and feel free to open this issue again.
We got surprised that
dateparser.parse()
gets slower over time (in a long-running process ~1 day), from hundereds of ms to 3 seconds per call! One possible cause is the following:DateDataParser.get_date_data()
stores some previously used locales:However, we observed that
dateparser._default_parser.previous_locales
contained many duplicate instances of the same set of locales. Eg. 2189 instances of 63 unique locales.The problem with the code above is that
self.previous_locales
should be a set, possibly ordered, but there's no check thatlocale
is already inside.Anyway, even when we run
dateparser.parse()
it's pretty slow (~400 ms / item). However with onlylanguages=['en]
it's really fast (~0.25 ms / item).Package version: latest 0.7.0, Python 2.7.