scrapinghub / number-parser

Parse numbers written in natural language
BSD 3-Clause "New" or "Revised" License
104 stars 23 forks source link

Support ordinals #6

Open noviluni opened 4 years ago

noviluni commented 4 years ago

I open this ticket to track the ordinal's feature.

From my understanding, what we should achieve is:

>>> parse('first')
'1st'

>>> parse('second')
'2nd'

>>> parse('third')
'3rd'

>>> parse('twenty-third')
'23rd'

>>> parse('thirtieth')
'30th'

However, as we support other words in the sentence, we should probably take care of some ambiguous words. I would take special care to "second". I think it should be translated to "2nd" only when it's not preceded by:

Of course, this logic would be probably necessary to be applied only to some languages, so it shouldn't be inside the main logic but in a language-specific section.

arnavkapoor commented 4 years ago

Hi @noviluni so I had begin working on the support for ordinal numbers. The best approach I believe is to create similar structure like the cardinal numbers. One direction was to somehow extend the cardinal numbers to handle ordinal too. (storing additional suffix only , example th for English ). However there is a major difference between the ordinal and cardinal number in other languages.

22 - veintidós
22nd - vigésimo segundo

So, thus I plan to update the data files with the following proposed structure. I am thinking of adding the tokens for negative and decimal numbers too for future features. (For English negative_tokens might be 'minus', 'negative' and decimal_tokens would be 'point', 'dot' )

{
    "CARDINAL_NUMBERS": {
        "UNIT_NUMBERS": {},
        "DIRECT_NUMBERS": {},
        "TENS": {},
        "HUNDREDS": {},
        "BIG_POWERS_OF_TEN": {}
    },
    "ORDINAL_NUMBERS":{
        "UNIT_NUMBERS": {},
        "DIRECT_NUMBERS": {},
        "TENS": {},
        "HUNDREDS": {},
        "BIG_POWERS_OF_TEN": {}
    },
    "SKIP_TOKENS": [],
    "NEGATIVE_TOKENS": [],
    "DECIMAL_TOKENS":[],
    "LONG_SCALE": false
}
noviluni commented 4 years ago

Hi @arnavkapoor! It looks good! However, I'm not 100% sure of adding negative and decimal tokens right now for two reasons:

Does this make sense?

About the naming, it's ok :). Maybe we could change CARDINAL_NUMBERS by just NUMBERS, but up to you.

arnavkapoor commented 4 years ago

Currently ordinal number support exists for only English language. https://github.com/arnavkapoor/number-parser/pull/31#pullrequestreview-461492622 . There needs to be changes to incorporate other languages. One way could be updating the _apply_cardinal_conversion mentioned here for other languages https://github.com/arnavkapoor/number-parser/pull/31#issuecomment-669913867 . The other could be creating same structure as cardinal number for ordinal number. The merged PR for ordinal number support for English is #35