scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.5k stars 466 forks source link

Fix date_parser with prefer_month_of_year wrong results #1224

Closed benbuzbee closed 2 months ago

benbuzbee commented 2 months ago

Fix two problems

  1. Parser would use current month even if prefer_month_of_year was not current when relative_base was not none

  2. Parser would use current month to derive 'what is the last day of this month' - for example, with prefer_month=last and prefer_day=last, but current_month=april, it would return december 30th, because it would use april to find that the last day was the 30th, when it should have used December (#1217)

Additionally, add a test to test_date_parser that uses prefer_month

Fixes #1217

benbuzbee commented 2 months ago

Tagging folks here who were involved in the original PR https://github.com/scrapinghub/dateparser/pull/1146

@adnan-awan @Gallaecio @serhii73 could I have your comments and reviews please

Gallaecio commented 2 months ago

Could you have a look at the 2 failing tests?

benbuzbee commented 2 months ago

Certainly @Gallaecio - sorry I missed that locally - tox / all tests do not run well on Windows where I developed this so I just ran select tests and expected to check all of them in CI

In the first case (test_dates_parse_utc_offset_does_not_throw), it is parsing "0:4", French, with settings

                "PREFER_DATES_FROM": "past",
                "PREFER_DAY_OF_MONTH": "first",
                "PREFER_LOCALE_DATE_ORDER": True,
                "PREFER_MONTH_OF_YEAR": "current",
                "RELATIVE_BASE": datetime(
                    year=1970, month=1, day=1, hour=0, minute=0, second=0
                ),

it used to expect to get expected_date=datetime(1969, 12, 31, 14, 4) but after my change it gets datetime(1969, 1, 31, 14, 4) I would argue that with PREFER_MONTH_OF_YEAR set to "Current", and "Current" being January 1st 1970, that datetime(1969, 1, 31, 14, 4) is a better result However with this particular set of configuration, I am not exactly 100% sure what to expect. These settings were generated by a fuzzer so perhaps they don't really make a ton of sense together anyway; rather than change the settings (and thus deviate from what the parser caught) I have opted to update the test expectation to accept January. CC @ennamarie19 who made that change in case I missed something.

The next one is even more perplexing, it is searching a German string for dates and asserting that when it finds the word "Die" in the string, it should be parsed as datetime.datetime(1999, 12, 28, 0, 0) Similarly, my change makes this datetime.datetime(1999, 1, 28, 0, 0) instead. I don't speak German, but as far as I can tell "Die" just means "The" so I have no idea why it is even matching it. In my opinion, this could be a bug with the search identifying a non-date word, and so I can't really guess as to what a sensible result would be. For the sake of simplicity, I also just updated this test to accept January. CC @eszakharova who added this in https://github.com/scrapinghub/dateparser/commit/7e7e22417e71e103225a2fdf05e68a6e67db041c