Open ivopisarovic opened 2 years ago
print(dateparser.parse('23.12.', languages=['cs']))
# 2021-12-21 23:12:00 -> INCORRECT, should be 2021-12-23 00:00:00
+1 that this is worth fixing. Are there any locales where dots could be used as hour/minute separators?
The international standard ISO 8601:2004 uses double dots to separate hours and minutes. However, the Czech national standard ČSN 01 6910 allows all different forms. It is possible to use both dots and double dots as hour/minute separators, with or without leading zeros:
0.00
, 7.30
, 23.55
0:00
, 7:30
, 23:55
00:00
, 07:30
, 23:55
(international ISO 8601:2004)I think that the key difference is the trailing dot, not the separator. A date has always a dot at the end (23.12.
or 23. 12.
) but the time has never the trailing dot (23.12
or 23:12
). The reason is that days and months as part of dates are always ordinal numbers in Czech. Ordinal numbers have always a trailing dot (10.
) similarly to 10th
in English.
I have checked other locales. It seems that Kazakh uses 13.01
without the trailing dots 🤦♂️ You will always find an exception if you want :D
https://unicode-org.github.io/cldr-staging/charts/37/verify/dates/kk.html
I think it would be better to fix this only for Czech to avoid breaking other locales.
I have just found another problem. There is an inconsistency of date order between parse
and search
methods. The correct date is DMY in Czech as it is set in language data files. However, search
does not take it into account.
search_dates('12. 1.', languages=['cs'])
[('12. 1', datetime.datetime(2022, 12, 1, 0, 0))]. # incorrect
parse('12. 1.', languages=['cs'])
datetime.datetime(2022, 1, 12, 0, 0) # correct
I stumbled across a similar issue with language de
. However, the exact example from above does not work. See
Original example works for de
:
>>> dateparser.parse('12. 1.', languages=['de'])
datetime.datetime(2022, 1, 12, 0, 0)
01.01. fails (as all other strings where the numbers could be interpreted as hour/minute):
>>> dateparser.parse('01.01.', languages=['de'])
datetime.datetime(2022, 1, 11, 1, 1)
Omitting leading 0 works:
>>> dateparser.parse('1.1.', languages=['de'])
datetime.datetime(2022, 1, 1, 0, 0)
PS: fixing via date_format does not work for me because then PREFER_DATES_FROM: past
is not taken into account. Hence this issue blocks me from upgrading to dateparser >= 1.0.0
Hi, I have found out that a date expressed as
dd.mm.
is incorrectly interpreted as a time.In Czech, we use the dot
.
to separate days, months and years. We use the double dot:
to separate hours, minutes and seconds. I would expect to define the separators somewhere in dateparser/data/date_translation_data/cs.py.If you confirm that this is an issue, I am eager to dig into the library and prepare a PR.
Hotfix: Adding a custom date format solves this issue. However, it does not work for
search_dates()
as a customdate_formats
attribute is not supported there.Hotfix2: Adding a space between day and month using a regex solves the problem even for
search_dates()
.