scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.56k stars 465 forks source link

Fixing current_period logic for weekdays #1118

Closed EtorixDev closed 1 year ago

EtorixDev commented 1 year ago

When the current_period setting is used the current period for weekdays should be treated as the current week.

Test Date: 2023-01-09 00:00:00 (Monday) Test Input: dateparser.parse("Thursday", settings={'PREFER_DATES_FROM': 'current_period'}) Current Output: 2023-01-05 00:00:00 (Previous Week) Desired Output: 2023-01-12 00:00:00 (Current Week)

Gallaecio commented 1 year ago

Could you extend tests to cover the change?

EtorixDev commented 1 year ago

Could you extend tests to cover the change?

I apologize, I'm quite new to using github. Could you expand on what more I need to do, and how/where to do it?

Gallaecio commented 1 year ago

Please, have a look at https://dateparser.readthedocs.io/en/latest/contributing.html#get-started, which covers running tests locally with tox.

Once you have run the tests locally, you will see if your change introduced unexpected changes. If not, the idea is that you extend the test code in the tests/ folder with new expectations that verify that your changes here work as intended, and will cause an error is we accidentally revert or break your change here.

EtorixDev commented 1 year ago

I have looked at the contributing guidelines, but they seem a bit sparse on explanations. Perhaps that's just my lack of knowledge at play though.

Tox fails due to fasttext failing. I've never dealt with tox before, but I know I do have fasttext installed, so I'm not sure why this is occurring. Also, I'm not sure which test file I should be editing, how to format my tests, or how how to actually do the testing with the test file.

I just wanted to propose a small logic change, but perhaps I've stepped into territory out of my league at the moment. I'm content with maintaining my change on my fork for personal use for now. If someone else would like to take over if they believe my changes are worth implementing, please do. If I should move this to an issue as suggested feedback instead, let me know. Thanks.

EtorixDev commented 9 months ago

Please, have a look at https://dateparser.readthedocs.io/en/latest/contributing.html#get-started, which covers running tests locally with tox.

Once you have run the tests locally, you will see if your change introduced unexpected changes. If not, the idea is that you extend the test code in the tests/ folder with new expectations that verify that your changes here work as intended, and will cause an error is we accidentally revert or break your change here.

@Gallaecio

Hello, I've since gotten more familiar with the technology being used here and can run tox, so I'd like to look into this issue again.

You mention "if your change introduced unexpected changes" so I'd like clarification on how to proceed with my changes before making a new PR.

On a fresh clone of the repo with no changes made I get 9 failed tests. With my changes proposed here I get 30 failed, so 21 caused by my changes. Looking just at those, it makes sense they fail. That would be expected with how my change modifies the functionality in these 21 cases.

Example:

param("Friday", datetime(2012, 11, 9)),
settings={"NORMALIZE": False, "RELATIVE_BASE": datetime(2012, 11, 13)}
FAILED tests/test_date_parser.py::TestDateParser::test_dates_parsing_004_Friday - AssertionError: datetime.datetime(2012, 11, 9, 0, 0) != datetime.datetime(2012, 11, 16, 0, 0)

2012, 11, 13 is Tuesday November 13, 2012.

The current implementation of current_period treats the current period as the past 7 days, so the current implementation goes back 4 days from Tuesday to the previous Friday. This is almost identical to how past works.

My updated implementation of current_period treats the current week (as defined in parser.py as ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']) as the current period. So it goes forward 3 days from Tuesday to the current Friday.

. . . Important to note that since I made this PR I modified the logic. I've also restructured my fork to use branches properly. Here's my current version of the code I'm looking to change: https://github.com/EtorixDev/dateparser/blob/current_period/dateparser/parser.py#L507

Gallaecio commented 9 months ago

This is almost identical to how past works.

Almost or identical? If identical, it sounds like you are fixing a bug. If almost, it sound like we need to have extra keys to cover a behavior that is neither past nor current.

EtorixDev commented 9 months ago

Almost or identical? If identical, it sounds like you are fixing a bug. If almost, it sound like we need to have extra keys to cover a behavior that is neither past nor current.

Almost identical. Here's a concrete example.

import dateparser
import datetime

settings_one={
    'PREFER_DATES_FROM': 'past', 
    'RELATIVE_BASE': datetime.datetime(2024, 2, 7)
}
settings_two={
    'PREFER_DATES_FROM': 'current_period', 
    'RELATIVE_BASE': datetime.datetime(2024, 2, 7)
}

# RELATIVE_BASE is Wednesday February 7th, 2024
print("PREFER_DATES_FROM: past")
print(dateparser.parse('Monday', settings=settings_one))
print(dateparser.parse('Tuesday', settings=settings_one))
print(dateparser.parse('Wednesday', settings=settings_one))
print(dateparser.parse('Thursday', settings=settings_one))
print(dateparser.parse('Friday', settings=settings_one))
print(dateparser.parse('Saturday', settings=settings_one))
print(dateparser.parse('Sunday', settings=settings_one))
print()
print("PREFER_DATES_FROM: current_period")
print(dateparser.parse('Monday', settings=settings_two))
print(dateparser.parse('Tuesday', settings=settings_two))
print(dateparser.parse('Wednesday', settings=settings_two))
print(dateparser.parse('Thursday', settings=settings_two))
print(dateparser.parse('Friday', settings=settings_two))
print(dateparser.parse('Saturday', settings=settings_two))
print(dateparser.parse('Sunday', settings=settings_two))

This outputs the following:

PREFER_DATES_FROM: past
2024-02-05 00:00:00
2024-02-06 00:00:00
2024-01-31 00:00:00
2024-02-01 00:00:00
2024-02-02 00:00:00
2024-02-03 00:00:00
2024-02-04 00:00:00

PREFER_DATES_FROM: current_period
2024-02-05 00:00:00
2024-02-06 00:00:00
2024-02-07 00:00:00
2024-02-01 00:00:00
2024-02-02 00:00:00
2024-02-03 00:00:00
2024-02-04 00:00:00

As you can see, the results are identical except for the day of the relative base.

past parses as the 31st to the 6th.

current_period parses as the 1st to the 7th.

Currently past operates as "the 7 days before the relative base" and current_period operates as "the 6 days before the relative base and itself." past is exclusive and current_period is inclusive.

My argument is that current_period should instead parse as the 5th to the 11th, aka the current week of Monday to Sunday.

Now you bring up the point this could instead be implemented as a new key rather than changing the current_period behavior. I'm not sure if that is the better route here, but to me it just doesn't make sense how current_period currently operates based on its name.

Gallaecio commented 9 months ago

You say you don’t get why current_period works that way… Well, I could understand if current_period meant “current month”. But why does “past” give you a date in the future (6th)? That seems like a bug to me.

To me it seems we need to look into past as potentially bugged, and consider maybe splitting current_period into multiple current_<period> options (current_year, current_week…).

I am brain dumping out loud here. The point is, I agree that output seems wrong, I am just not sure what the right fixes should be.

EtorixDev commented 9 months ago

It outputs the 6th because the relative base is the 7th. The middle of the week, Wednesday.

EtorixDev commented 9 months ago

I've just tested it with months as per your reply above and the results are interesting. Whereas for days of the week past is exclusive, for months past is inclusive.

import dateparser
import datetime

settings_one={
    'PREFER_DATES_FROM': 'past', 
    'RELATIVE_BASE': datetime.datetime(2024, 3, 1)
}
settings_two={
    'PREFER_DATES_FROM': 'current_period', 
    'RELATIVE_BASE': datetime.datetime(2024, 3, 1)
}

# RELATIVE_BASE is Friday March 1st, 2024
print("PREFER_DATES_FROM: past")
print(dateparser.parse('January', settings=settings_one))
print(dateparser.parse('February', settings=settings_one))
print(dateparser.parse('March', settings=settings_one))
print(dateparser.parse('April', settings=settings_one))
print(dateparser.parse('May', settings=settings_one))
print(dateparser.parse('June', settings=settings_one))
print(dateparser.parse('July', settings=settings_one))
print()
print("PREFER_DATES_FROM: current_period")
print(dateparser.parse('January', settings=settings_two))
print(dateparser.parse('February', settings=settings_two))
print(dateparser.parse('March', settings=settings_two))
print(dateparser.parse('April', settings=settings_two))
print(dateparser.parse('May', settings=settings_two))
print(dateparser.parse('June', settings=settings_two))
print(dateparser.parse('July', settings=settings_two))
PREFER_DATES_FROM: past
2024-01-01 00:00:00
2024-02-01 00:00:00
2024-03-01 00:00:00
2023-04-01 00:00:00
2023-05-01 00:00:00
2023-06-01 00:00:00
2023-07-01 00:00:00

PREFER_DATES_FROM: current_period
2024-01-01 00:00:00
2024-02-01 00:00:00
2024-03-01 00:00:00
2024-04-01 00:00:00
2024-05-01 00:00:00
2024-06-01 00:00:00
2024-07-01 00:00:00

Here, the relative base is March of 2024. Per how it works with days, it would be expected that "March" parses as March of 2023, but it is instead inclusive and parses as March of 2024.