scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.55k stars 465 forks source link

German date with leading zeros not working properly when no year given #902

Open SiKreuz opened 3 years ago

SiKreuz commented 3 years ago

Reproduction

I'm using this method to parse a german date without year:

dateparser.parse('13.01.', languages=['de'])

What I get returned is a datetime object with the current date and time, no reference to my given date-string. If I leave out the zero, it's working. If I add a year, it's working. So the following is no problem and parsed properly:

dateparser.parse('13.1.', languages=['de'])
dateparser.parse('13.01.2001', languages=['de'])

Using the format MM/DD is also no problem and working fine for me, but DD.MM is not.

System

Gallaecio commented 3 years ago

With master:

>>> dateparser.parse('13.01.', languages=['de'])
datetime.datetime(2021, 4, 5, 13, 1)

So it seems it is picked as time, rather than date, which does not seem too wrong.

SiKreuz commented 3 years ago

Oh, I didn't see this. Well, usually you don't write the time with dots in German, that's why this is confusing. But ok, this is fine for me. Is there any possibility to force the parsing for a date and not the time?

Gallaecio commented 3 years ago

I wonder if one of the settings can help here. Otherwise, maybe we should have a way to indicate that only a time or a date is expected, as opposed to a date and time.

SiKreuz commented 3 years ago

Well, I tried some settings, but couldn't come out with my desired behavior. So it would be really nice to have this option you mentioned.

noviluni commented 3 years ago

Hi @SiKreuz

There's a workaround to get the desired result, but before that:

Well, usually you don't write the time with dots in German,

This surprises me, as this feature (hours with a period as separator) was introduced because of German :thinking:. Check this:

https://github.com/scrapinghub/dateparser/issues/643

image

I also saw this in Spanish (and it seems that it's also used in Finnish), so I also think it's acceptable. However, I also heard from other people (like Russians) that it's not correct at all and when they have "XX.YY" it always refers to a date, so maybe we have to rethink this approach. Adding a new setting, perhaps?

In case anyone wants to work on this, you can check here how was this implemented: https://github.com/scrapinghub/dateparser/pull/741/files


Workaround:

If you are trying to fix this in your particular project, you can use the date_formats:

>>> dateparser.parse('13.01', languages=['de'], date_formats=['%d.%m'])
datetime.datetime(2021, 1, 13, 0, 0)

However, if you want to be able to parse "13.01." also (with a trailing point) you should add it too:

>>> dateparser.parse('13.01.', languages=['de'], date_formats=['%d.%m', '%d.%m.'])
datetime.datetime(2021, 1, 13, 0, 0)

Let me know if it works for you :slightly_smiling_face:

SiKreuz commented 3 years ago

Hi @noviluni

the usual way to write the time in german is with a colon. Of course, you can also write it with a dot and everyone will understand, but the colon is the more common way.

Obviously I can do it with your workaround. But I have mutliple languages used in my software and it would be nice to not manually switch the date format accordingly to the current language. An option to define whether I want to parse a date or a time would be awesome. For now I'll go with the workaround. Thank you!