scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.5k stars 466 forks source link

Strange parser error: search_dates parses "2010 Year" to a date with year of 4033 #1193

Open leeprevost opened 8 months ago

leeprevost commented 8 months ago

Very strange issue.

dateparser.__version__
'1.1.8'

settings= {
 'RELATIVE_BASE': datetime.datetime(2023, 7, 31, 0, 0),
 'PREFER_DAY_OF_MONTH': 'first',
 'PREFER_DATES_FROM': 'future',
 'REQUIRE_PARTS': ['year', 'month'],
 'DATE_ORDER': 'YMD'
}
s = 'Closing Yield, 2010 Year Treasury notes On Dec 31, 2023'
search_dates(s, settings=settings)

Result: Out[27]:

[('2010 Year', datetime.datetime(4033, 7, 31, 0, 0)),
 ('On Dec 31, 2023', datetime.datetime(2023, 12, 31, 0, 0))]

(impossible year 4033 from the first part of the parse)

Also, put this question on SO *link:**

Gallaecio commented 8 months ago

This is because year is interpreted the same as years, and “2010 years” is interpreted as “2010 years later“.

Maybe we could make it so that if it is year, singular, it only works like that for “1 year”, and otherwise it gets translated to “year 2010” for example. But it may not be trivial to address.

leeprevost commented 8 months ago

OK, thank you. I can work around this now that I know what the rules are. Could you point me to source so that I can see the ruleset? And is that user configurable?

Gallaecio commented 8 months ago

The code base is relatively complex, and I don’t think this case is user configurable at the moment.

leeprevost commented 8 months ago

OK - I thought I saw a definitions page with the regex sequences it was using to parse. But, if not easy, I'll work around this. Want me to close this out?

Gallaecio commented 8 months ago

Want me to close this out?

No, I think this is a valid issue, and we want to eventually address it.