scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.55k stars 465 forks source link

False positives when searching dates #582

Open g-kozulis opened 4 years ago

g-kozulis commented 4 years ago

OS: Windows 10.0.17763.805 dateparser version: 0.7.2

When using the search_dates() function some numerical and punctuation mark combinations that don't resemble any date format I've ever seen get picked up as dates.

To reproduce run the following code and replace <false positive> with any one of the following:

from dateparser.search import search_dates

search_dates("The following isn't a correct date <false positive>")
murray-minito commented 4 years ago

Same here on OSX 10.15 with version 0.7.2

Here are some examples of results that should not be dates

search_dates(text,languages=['en'], settings={'STRICT_PARSING': True,'PREFER_DATES_FROM': 'past','DATE_ORDER': 'DMY'}, add_detected_language=True)

-- Clearly wrong ('32° 34’S', datetime.datetime(2013, 10, 16, 23, 59, 7), 'en') ('123°', datetime.datetime(1900, 1, 1, 1, 2, 3), 'en') ('6005', datetime.datetime(2000, 6, 5, 0, 0) ('000', datetime.datetime(1900, 1, 1, 0, 0), 'en') ('of 629', datetime.datetime(1900, 1, 1, 6, 2, 9), 'en') ('>21', datetime.datetime(1900, 1, 1, 2, 1), 'en')

-- I can kind of see where it is getting this but I think it is wrong to do it ('3533', datetime.datetime(2033, 5, 3, 0, 0), 'en')

-- I have lots of numbers in these docs. It should not pick them up and 'make' a date from them ('538400', datetime.datetime(8400, 3, 5, 0, 0), 'en')

noviluni commented 4 years ago

FYI some cases will be fixed in the next version (after merging this: https://github.com/scrapinghub/dateparser/pull/786)

gavishpoddar commented 3 years ago

Seems #786 has been merged, can you please close this issue

Gallaecio commented 3 years ago

Does https://github.com/scrapinghub/dateparser/pull/786 fix all cases reported here? Otherwise, it makes sense to keep this open.