scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.55k stars 465 forks source link

CS: Incorrect use of date as time #1029

Open ivopisarovic opened 2 years ago

ivopisarovic commented 2 years ago

Hi, I have found out that a date expressed as dd.mm. is incorrectly interpreted as a time.

In Czech, we use the dot . to separate days, months and years. We use the double dot : to separate hours, minutes and seconds. I would expect to define the separators somewhere in dateparser/data/date_translation_data/cs.py.

If you confirm that this is an issue, I am eager to dig into the library and prepare a PR.

print(dateparser.parse('23. 12.', languages=['cs']))
# 2021-12-23 00:00:00 -> OK

print(dateparser.parse('23.12.', languages=['cs']))
# 2021-12-21 23:12:00 -> INCORRECT, should be 2021-12-23 00:00:00

print(dateparser.parse('23:12', languages=['cs']))
# 2021-12-21 23:12:00 -> OK

Hotfix: Adding a custom date format solves this issue. However, it does not work for search_dates() as a custom date_formats attribute is not supported there.

print(dateparser.parse('23.12.', languages=['cs'], date_formats=['%d.%m.']))

Hotfix2: Adding a space between day and month using a regex solves the problem even for search_dates().

utterance = re.sub(r'\b([1-3][0-9]|[1-9])\.([1-9]|12)(\.|\b)', r'\1. \2.', utterance) 
lopuhin commented 2 years ago
print(dateparser.parse('23.12.', languages=['cs']))
# 2021-12-21 23:12:00 -> INCORRECT, should be 2021-12-23 00:00:00

+1 that this is worth fixing. Are there any locales where dots could be used as hour/minute separators?

ivopisarovic commented 2 years ago

The international standard ISO 8601:2004 uses double dots to separate hours and minutes. However, the Czech national standard ČSN 01 6910 allows all different forms. It is possible to use both dots and double dots as hour/minute separators, with or without leading zeros:

I think that the key difference is the trailing dot, not the separator. A date has always a dot at the end (23.12. or 23. 12.) but the time has never the trailing dot (23.12 or 23:12). The reason is that days and months as part of dates are always ordinal numbers in Czech. Ordinal numbers have always a trailing dot (10.) similarly to 10th in English.

ivopisarovic commented 2 years ago

I have checked other locales. It seems that Kazakh uses 13.01 without the trailing dots 🤦‍♂️ You will always find an exception if you want :D

https://unicode-org.github.io/cldr-staging/charts/37/verify/dates/kk.html

I think it would be better to fix this only for Czech to avoid breaking other locales.

ivopisarovic commented 2 years ago

I have just found another problem. There is an inconsistency of date order between parse and search methods. The correct date is DMY in Czech as it is set in language data files. However, search does not take it into account.

search_dates('12. 1.', languages=['cs'])
[('12. 1', datetime.datetime(2022, 12, 1, 0, 0))]. # incorrect
parse('12. 1.', languages=['cs'])
datetime.datetime(2022, 1, 12, 0, 0)  # correct
scriptator commented 2 years ago

I stumbled across a similar issue with language de. However, the exact example from above does not work. See

Original example works for de:

>>> dateparser.parse('12. 1.', languages=['de'])
datetime.datetime(2022, 1, 12, 0, 0)

01.01. fails (as all other strings where the numbers could be interpreted as hour/minute):

>>> dateparser.parse('01.01.', languages=['de'])
datetime.datetime(2022, 1, 11, 1, 1)

Omitting leading 0 works:

>>> dateparser.parse('1.1.', languages=['de'])
datetime.datetime(2022, 1, 1, 0, 0)

PS: fixing via date_format does not work for me because then PREFER_DATES_FROM: past is not taken into account. Hence this issue blocks me from upgrading to dateparser >= 1.0.0