scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.55k stars 465 forks source link

dateparser not able to parse things like next tuesday. #573

Open saroup opened 5 years ago

saroup commented 5 years ago

It parses Tuesday to the date of the Tuesday of the current week but when input is next Tuesday it returns none.

tannercollin commented 5 years ago

I'm having the same issue except it's returning Tuesday of the previous week:

>>> parse('now').strftime('%a %Y-%m-%d')
'Mon 2019-10-21'
>>> parse('tuesday').strftime('%a %Y-%m-%d')
'Tue 2019-10-15'
>>> parse('next tuesday').strftime('%a %Y-%m-%d')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'strftime'
GargiVyas31 commented 4 years ago

Hi, I am Gargi Vyas. I am GSOC 2020 candidate and would like to work on this bug.

tannercollin commented 4 years ago

That'd be awesome, Gargi. Thanks!

aditya-hari commented 4 years ago

@Gallaecio, @noviluni I have looked through the code and I think I understand the gist of it at this point. What would the recommended way to tackle this? Would appreciate some suggestions to get started. A separate function in FreshnessDateDataParser maybe?

Gallaecio commented 4 years ago

@aditya-hari Go ahead and propose an approach in a pull request. It’s easier to discuss over code :slightly_smiling_face:

aditya-hari commented 4 years ago

@Gallaecio I haven't really come up with anything concrete in code yet, can't open a pull request.

Things like 'next tuesday' aren't identified with any locale, so there has to be some changes made in the locale info to sort that out. I am not entirely sure how to though.

I thought about just changing the date_string to something standard like "in x days/months" but that will obviously only work for English if implemented in that way.

Gallaecio commented 4 years ago

@aditya-hari I suggest you start from FreshnessDateDataParser.parse, go through what the code does keeping the target strings in mind (e.g. “Next Tuesday”), and make the required changes as you go. I see for example that ago and in are hardcoded in some parts, I guess you will need to add next there.

You could add a test for “Next Tuesday”, extend FreshnessDateDataParser.parse as needed until it is parsed successfully, and then make sure no other tests are broken after your changes.

aditya-hari commented 4 years ago

@Gallaecio

Sorry it is taking me this long, I have something sort of working, I will hopefully open a PR soon. However the way I am doing this won't be able to handle the "after 15 days" situation mentioned in #635

Gallaecio commented 4 years ago

Not a problem. It’s OK to just fix “Next ” for now, we can improve things later with additional, separate changes.

jgtimestuff commented 4 years ago

There are a lot of time-related translations available in the unicode-cldr xml or json files that could definitely be used to augment dateparser.py with things that handle all sorts of variations like 'Next Tuesday'. Of course, I'd also like to see something that cover 'Next Weekend' or 'on the weekend'... but it doesn't look like that's been defined as yet.

Anyway, what would it take to pull in the cldr datefields for each language and incorporate them?

https://github.com/unicode-cldr/cldr-dates-full/blob/master/main/ru/dateFields.json

jgtimestuff commented 4 years ago

Okay, just saw this issue from 2 years ago -- cldr_language_data | move data directory | 2 years ago So, is it just that the cldr_language_data needs updating to include more variations of 'next'?

My apologies... it seems there is a script to do just this already in the code:

487 CLDR script update - https://github.com/scrapinghub/dateparser/compare/cldr-script-update

Is the current dictionary up to date or is it just that the existing code isn't calling things like 'next' that already exist in the code?

jgtimestuff commented 4 years ago

In freshness_date_parser, I think we need to add something from calendar to get the right day of the week?

    td = relativedelta(**kwargs)

relativedelta arguments for 'next' + dayofweek needs to add a day, then check the calendar for the next one?

today = datetime.datetime.now() (happens to be Friday) today + relativedelta.relativedelta(weekday=calendar.FRIDAY)

today + rld.relativedelta(weekday=calendar.FRIDAY) datetime.datetime(2020, 3, 20, 8, 55, 7, 615746) [today, instead of next friday]

so, we have to add a day to today, then look for next Friday:

today + rld.relativedelta(days=+1) datetime.datetime(2020, 3, 21, 8, 55, 7, 615746) today = today + rld.relativedelta(days=+1) today + rld.relativedelta(weekday=calendar.TUESDAY) datetime.datetime(2020, 3, 24, 8, 55, 7, 615746) today + rld.relativedelta(weekday=calendar.FRIDAY) datetime.datetime(2020, 3, 27, 8, 55, 7, 615746) [ Next Friday ]

today = today + rld.relativedelta(days=+1) today + rld.relativedelta(weekday=calendar.TUESDAY) datetime.datetime(2020, 3, 24, 8, 55, 7, 615746) [ Next Tuesday ]

Gallaecio commented 4 years ago

https://dateparser.readthedocs.io/en/latest/contributing.html#guidelines-for-editing-translation-data might shed some light

jgtimestuff commented 4 years ago

Thanks, after reading through that link, it seems that is about extending linguistic terms beyond what is provided by the CLDR json files. It looks to me like the json files from CLDR were last imported to dateparser in 2018 and they seem to have a lot less options for relative terms (in English as well as all languages) than what is currently available. This might aid in fixing the 'next weekday' issue...

Although, supplementing that data with 'weekend' would definitely fall under extending the terms as the files don't seem to cover terms like weekend or perhaps even 'fortnight' used by Aussie's etc...

jgtimestuff commented 4 years ago

Ahh wait, now I see that this script looks at the CLDR but only chooses a subset of the available relative terms to transition to dateparser.

https://github.com/scrapinghub/dateparser/blob/master/scripts/get_cldr_data.py

Gallaecio commented 4 years ago

So we might want to extend that subset as needed by changing the download script, and re-running.

thinow commented 3 years ago

Hi everyone, Thanks for looking at this issue. There is no updates since more than one year. Any work around we could use?

anarcat commented 2 years ago

my workaround is to load both dateparser and parsedatetime and use the latter when the former fails. :)

tannercollin commented 2 years ago

This is my work around:

days_long = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']
for day in days_long:                                                                    
    print('trying to find:', day)                                                        
    if day in time:                                                                      
        print('found', day)                                                              
        delta = 1                                                                        
        while day not in (datetime.now() + timedelta(days=delta)).strftime('%A').lower():
            delta += 1                                                                   
            print('delta:', delta)                                                       
            if delta > 14: raise # just to make sure                                     
        if re.findall(r'\d|noon|midnight', time):                                        
            date = (datetime.now() + timedelta(days=delta)).strftime('%Y-%m-%d')         
        else:                                                                            
            date = daystr((datetime.now() + timedelta(days=delta)))                      
        print('date:', date)                                                             
        time = time.replace(day, date).replace('next', '')                               
        print('time:', time)                                                             
        break # only first match                                                         
    else:                                                                                
        print('not found')                                                               

It's janky, but it works.