scrapinghub / dateparser

python parser for human readable dates
BSD 3-Clause "New" or "Revised" License
2.54k stars 465 forks source link

dateparse can't handle cases where the same word has different meanings #676

Open noviluni opened 4 years ago

noviluni commented 4 years ago

There are languages where some words have different meanings. This is generating some issues like this: https://github.com/scrapinghub/dateparser/issues/337

I'm creating this issue to track this and try to find a solution.

As far as I have seen, the translation is performed before the date is parsed, so we can't select the valid meaning using the other part as context. Apart from that, those words are inserted into a Python dictionary containing "word: meaning", overriding the other words with a different meaning.

Using regex simplifications could fix some use cases, but is not a valid approach for most of the cases.

In some cases (like in the word "mar" for Italian), we could detect when there are double elements (for example two months) and find if there is any duplicated key for any of them and try to find the real meaning, but this would be probably hard to implement.

Please, if you have any idea don't hesitate to comment on this issue.


List of cases:

And there are a lot of other cases where the same word is used with different meanings:

noviluni commented 3 years ago

There is a new case: in French "sept" means "September" but also "seven" and it gets confused.

Original issue: https://github.com/scrapinghub/dateparser/issues/819

Gallaecio commented 3 years ago

One raw approach I can think of is to translate all possible ways. Create as many copies of the translated strings as combinations of possible interpretations are possible, and yield a result for the first match possible.

However, instead of passing the first translation through all parsers and then trying the second translation, it may make sense to pass all translations through the first parser, then all translations through the second parser, etc.

Hopefully setting a language and a date order will allow to obtain the expected result in most cases.