scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.53k stars 463 forks source link

Wrong prioritization of languages #770

Open ivanprado opened 4 years ago

ivanprado commented 4 years ago

I think there is something wrong in dateparser prioritization of languages, as introducing 'en' even in the last position hurts extraction of dates that were extracted properly when English was not there.

import dateparser
dateparser.parse("11/12", languages=['en'])
Out[3]: datetime.datetime(2020, 11, 12, 0, 0)

This is right

dateparser.parse("11/12", languages=['es'])
Out[4]: datetime.datetime(2020, 12, 11, 0, 0)

This is also right, because the standard in Spain is DD/MM But now if we add English to the languages list in the last position...

dateparser.parse("11/12", languages=['es', 'en'])
Out[5]: datetime.datetime(2020, 11, 12, 0, 0)

We got it parsed like in English, even if Spanish is first in the list of languages. This is unexpected to me, I would have expected prioritizing Spanish instead.

noviluni commented 4 years ago

Hi @ivanprado!

The currently used order is that order defined in dateparser/dateparser/data/languages_info.py (FYI, this order is being questioned in this issue (https://github.com/scrapinghub/dateparser/issues/714) and it will probably change).

However, I agree that people using the languages parameter could expect it to respect the defined order. So we should probably change this behavior or document it.

This could be addressed by adding a setting (called, for example, USE_GIVEN_LANGUAGE_ORDER) and possibly making it True by default. The logic to implement this is practically finished, as we have the use_given_order property when creating the DateDataParser object, and we should just add a small portion of code and tests to allow this setting to work.

I will tag this as good_first_issue and I expect this to be solved before the end of October (or should I say "Hacktoberfest")? :slightly_smiling_face:

Thank you for your comment! :smile:

ivanprado commented 4 years ago

Thank you @noviluni, this sounds great :smile: .

Just to give you more context: The idea is that the languages provided to dateparser can come from a language detector model run over the page (this is my case). This gives you a list of languages, ordered by probability: the first one is the more probable, then the second one, etc.

So the idea is to give them to dateparser as a language hint, and the order is important in this case. Probably USE_GIVEN_LANGUAGE_ORDER will help in the case I'm describing. :+1:

mirceachira commented 3 years ago

Hi @noviluni @ivanprado ,

I added a pr for this, please take a look and let me know what you think about this implementation and I will fix and add some tests for it as well if so.

I believe the default should be False so that we make use of the list of most common languages for this @noviluni

serhii73 commented 1 year ago

We can close this issue because it was fixed in https://github.com/scrapinghub/dateparser/pull/805 and https://github.com/scrapinghub/dateparser/issues/845

In [1]: import dateparser

In [2]: dateparser.parse("11/12", languages=['en'])
Out[2]: datetime.datetime(2022, 11, 12, 0, 0)

In [3]: dateparser.parse("11/12", languages=['es'])
Out[3]: datetime.datetime(2022, 12, 11, 0, 0)

In [4]: dateparser.parse("11/12", languages=['es', 'en'])
Out[4]: datetime.datetime(2022, 11, 12, 0, 0)

In [5]: dateparser.parse("11/12", languages=['en', 'es'])
Out[5]: datetime.datetime(2022, 11, 12, 0, 0)

In [6]: dateparser.__version__
Out[6]: '1.1.4'
Gallaecio commented 1 year ago

@serhii73 Your output shows that it is still not fixed. Out [4] will match Out [3] once the issue is fixed,

serhii73 commented 1 year ago

Yes, you're right @Gallaecio