Open ivanprado opened 4 years ago
Hi @ivanprado!
The currently used order is that order defined in dateparser/dateparser/data/languages_info.py
(FYI, this order is being questioned in this issue (https://github.com/scrapinghub/dateparser/issues/714) and it will probably change).
However, I agree that people using the languages
parameter could expect it to respect the defined order. So we should probably change this behavior or document it.
This could be addressed by adding a setting (called, for example, USE_GIVEN_LANGUAGE_ORDER
) and possibly making it True
by default. The logic to implement this is practically finished, as we have the use_given_order
property when creating the DateDataParser
object, and we should just add a small portion of code and tests to allow this setting to work.
I will tag this as good_first_issue
and I expect this to be solved before the end of October (or should I say "Hacktoberfest")? :slightly_smiling_face:
Thank you for your comment! :smile:
Thank you @noviluni, this sounds great :smile: .
Just to give you more context: The idea is that the languages provided to dateparser
can come from a language detector model run over the page (this is my case). This gives you a list of languages, ordered by probability: the first one is the more probable, then the second one, etc.
So the idea is to give them to dateparser
as a language hint, and the order is important in this case. Probably USE_GIVEN_LANGUAGE_ORDER
will help in the case I'm describing. :+1:
Hi @noviluni @ivanprado ,
I added a pr for this, please take a look and let me know what you think about this implementation and I will fix and add some tests for it as well if so.
I believe the default should be False
so that we make use of the list of most common languages for this @noviluni
We can close this issue because it was fixed in https://github.com/scrapinghub/dateparser/pull/805 and https://github.com/scrapinghub/dateparser/issues/845
In [1]: import dateparser
In [2]: dateparser.parse("11/12", languages=['en'])
Out[2]: datetime.datetime(2022, 11, 12, 0, 0)
In [3]: dateparser.parse("11/12", languages=['es'])
Out[3]: datetime.datetime(2022, 12, 11, 0, 0)
In [4]: dateparser.parse("11/12", languages=['es', 'en'])
Out[4]: datetime.datetime(2022, 11, 12, 0, 0)
In [5]: dateparser.parse("11/12", languages=['en', 'es'])
Out[5]: datetime.datetime(2022, 11, 12, 0, 0)
In [6]: dateparser.__version__
Out[6]: '1.1.4'
@serhii73 Your output shows that it is still not fixed. Out [4]
will match Out [3]
once the issue is fixed,
Yes, you're right @Gallaecio
I think there is something wrong in dateparser prioritization of languages, as introducing 'en' even in the last position hurts extraction of dates that were extracted properly when English was not there.
This is right
This is also right, because the standard in Spain is DD/MM But now if we add English to the languages list in the last position...
We got it parsed like in English, even if Spanish is first in the list of languages. This is unexpected to me, I would have expected prioritizing Spanish instead.