scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.53k stars 463 forks source link

Parsing DMY, then try YMD - not possible #412

Open elgehelge opened 6 years ago

elgehelge commented 6 years ago

It seems to be impossible to get the following behaviour: parse("01-02-03") == "datetime.datetime(2003, 2, 1, 0, 0)" parse("2003-02-01") == "datetime.datetime(2003, 2, 1, 0, 0)" The problem is that when ever the first line evaluates true, then the second line will produce a date where day and month is swapped (like so datetime.datetime(2003, 1, 2, 0, 0))

This is the preferred way of parsing dates in all northern european countries btw.

elgehelge commented 6 years ago

I think the core of the problem is this behaviour:

>>> dateparser.parse("2003-02-01", settings={'DATE_ORDER': 'DMY'})
datetime.datetime(2003, 1, 2, 0, 0)
asadurski commented 6 years ago

Thank you for reporting that, @elgehelge, I have reproduced the problem. As a temporary workaround, you can use date_formats argument, as described in http://dateparser.readthedocs.io/en/latest/index.html#usage parse("2003-02-01", date_formats=["%Y-%m-%d"])

The choice of MDY format as default one has been discussed before. I'll bring back the core argument (backed by general observations, not data) that the default date settings on most web servers are still English (United States).

elgehelge commented 6 years ago

The reason why I would like to use the dateparser library is because I don't know the date format. Your workaround will fail on the first example ("01-02-03").

asadurski commented 6 years ago

date_formats list can be extended, so: parse("01-02-03", date_formats=["%Y-%m-%d", "%d-%m-%y"]) But I understand that if you have a multitude of formats, a setting that doesn't work is a serious limitation.

elgehelge commented 6 years ago

Let me just give you a little more insight into the problem. I figured out that pandas would actually work for my use case:

import pandas
parse = lambda string: pandas.to_datetime(string, dayfirst=True).to_pydatetime()

However, as it turns out, Sweden does their dates differently. They prefer YMD over DMY. Neither pandas or dateutil was abel to handle this use-case using dayfirst and yearfirst. So I guess what I really need is a monthinmiddle setting. Just phrased differently, the dayfirst and yearfirst is just a "broken" interface in my opinion, so don't try to go down that road.

crusaderky commented 3 years ago

Any update on this issue? It seems major to me.

I need to mass parse UK dates, which are DMY (and there's no locale), but I can't if that breaks ISO dates. People with a date format set in their locale flat out can't read ISO dates without disabling the locale first:

>>> dateparser.parse('le 2000-01-02')
datetime.datetime(2000, 2, 1, 0, 0)
fish-face commented 2 years ago

I also have this issue and can confirm it is still present in 1.1.0. A monthmiddle setting seems to be much more relevant than specifying a date order in the many cases when you don't know the format. An alternative would be to accept a list of orders to try. Or another alternative is that if DATE_ORDER is specified, parsing should fail if that order cannot be parsed, so that the user of the library can know and implement their own fallback.