scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.55k stars 465 forks source link

Is there a way to ignore numbers unless they are epochs? #713

Closed johntmyers closed 4 years ago

johntmyers commented 4 years ago

Currently just about any number will be parsed, is there a way to ignore this? Strict setting seems to have no affect.

Example:

In [65]: search_dates("hello 4000")                                                                                                                                                                       
Out[65]: [('4000', datetime.datetime(4000, 6, 18, 0, 0))]
noviluni commented 4 years ago

Hi @johntmyers !

What about using the PARSERS setting?

Example:

>>> search_dates("hello 4000", settings={'PARSERS': ['timestamp']})                                                                                                                    

>>> search_dates("hello 1592498315", settings={'PARSERS': ['timestamp']})                                                                                                              
[('1592498315', datetime.datetime(2020, 6, 18, 18, 38, 35))]

Let me know if this works for you. :slightly_smiling_face:

johntmyers commented 4 years ago

Yes but unfortunately I lose other matching. I am testing with ["timestamp", "absolute-time", "base-formats"] but it appears absolute-time is the one that matches on numbers, but also gives me parsing for other things I'd still want too.

noviluni commented 4 years ago

Hi @johntmyers!

I spent some time checking this and this is currently a bug.

This:

>>> dateparser.parse("4000", settings={'PARSERS': ['absolute-time'], 'STRICT_PARSING': True})                                                       
datetime.datetime(1900, 1, 1, 4, 0)

shouldn't return anything, but it does.

The reason is that the absolute-time parser tries different things but if they don't work, it tries to parse it as a date without spaces. This is not checking the STRICT_PARSING setting, so it returns a solution coming from this format: '%H%M%S'.

I have just opened a new draft PR (https://github.com/scrapinghub/dateparser/pull/715) addressing this and I will release it within the next version. One of our goals for the upcoming version is to check that all the settings are working properly, as it seems that there are some edge cases where they are not applied.

I'm sorry, but I don't know any workaround for you to fix it temporarily.

Thank you for your feedback.

johntmyers commented 4 years ago

Glad you found it! Thanks for addressing it so quickly!

noviluni commented 4 years ago

Hi @johntmyers, as we divided the old absolute-time parser in the absolute-time and no-spaces-time parsers and we deactivated the second one by default, you will be able to parse this correctly in the next version by using the STRICT_PARSING=True setting.

Example:


>>> search_dates('hallo 4000')
[('4000', datetime.datetime(4000, 9, 21, 0, 0))]

>>> search_dates('hallo 4000', settings={'STRICT_PARSING': True})