scrapinghub / number-parser

Parse numbers written in natural language
BSD 3-Clause "New" or "Revised" License
108 stars 23 forks source link

dangling "and" is swallowed #75

Closed ludgerheck closed 2 years ago

ludgerheck commented 2 years ago

when I enter "hundredandfive thirtyone and some other text" I get ['105', '31', 'some', 'other', 'text'] It seems, as if the dangling "and" is swallowed

dhananjaypai08 commented 2 years ago

Right now while parsing, the input string is tokenized and then it extends up to be ['hundredandfive', ' ', 'thirtyone', ' ', 'and', ' ', 'some', ' ', 'other', ' ', 'text']. So basically 'hundredandfive' and 'thirtyone' will not be parsed as spaces are expected in between two numbers right now. eg: fiftynine-> fifty nine = 59, hundredandfifty->hundred and fifty = 150

expected input string: parse('hundred and five thirty one and some other text')

@Gallaecio can we come out with some way where spaces between two numbers converging to be one(eg: thirtyone-> 31) is not required?

ludgerheck commented 2 years ago

I am sorry! I should not have left out intermediate steps. I wrote a tokenizer the splits up the compounds. So what actually gets to the number-parser is "hundred and five thirty one and some other text" -> ['105', '31', 'some', 'other', 'text'] And yes, the "and" is swallowed

incidently, this is what I did

    numwords = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty', 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety', 'hundred', 'thousand', 'million', 'billion', 'trillion']

defcanonise_en(self,words):

replace '-' with ' ' and split into single words

        wordlist = words.replace('-',' ').split(" ")         newwordlist = [] forword inwordlist:             newword = word forkey inself.numwords:                 pattern = "({})(?!t?(een|y))".format(key)                 newword = re.sub(pattern,r" \1",newword)

this is to check,that we haven't overdone our splitting: we only split

if the complete word can then be parsed

as in stentorian or fourwheeldrive, they will not be split, but

four-wheel-drive will (and loose its - :( )             ret = number_parser.parse(newword)             m = re.search(r"\D",ret)             newwordlist.append(word ifm elsenewword.strip())

recombine the results to a single string, remove all unnecessary

blancs and return         words = ' '.join(newwordlist).strip()         words = re.sub(r"\s\s+"," ",words,) returnwords This will yield 'seventeenhundredandeightynine' -> ' seventeen hundred and eighty  nine ' -> ['1789'] 'hundredandfive thirtyone and some other text' -> 'hundred and five thirty one and some other text' -> ['105', '31', 'some', 'other', 'text'] 'and twohundredandtwentyfivethousandeighthundredandthirtynine**and and' -> 'and two hundred and twenty five thousand eight hundred and thirty nine and and' -> ['and', '', '225839'] 'and twohundredandtwentyfivethousandeighthundredandthirtynine no and' -> 'and two hundred and twenty five thousand eight hundred and thirty nine no and' -> ['and', '', '225839', 'no', 'and'] 'two hundred and fifty five and' -> ' two hundred and fifty five and' -> ['255'] 'and now: one two three' -> 'and now: one two three' -> ['and', 'now:', '1', '2', '3'] 'the car has a fourwheel drive and one hundred hp' -> 'the car has a fourwheel drive and one hundred hp' -> ['the', 'car', 'has', 'a', 'fourwheel', 'drive', 'and', '100', 'hp'] 'the car has a four-wheel-drive and a hundred hp' -> 'the car has a four wheel drive and a hundred hp' -> ['the', 'car', 'has', 'a', '4', 'wheel', 'drive', 'and', 'a', '100', 'hp'] 'fiveandthirty' -> 'fiveandthirty' -> ['fiveandthirty'] 'five and thirty' -> 'five and thirty' -> ['5', 'and', '30']

leading and single "and"s are obviously fine, but repeated "and"s after a converted number seem all to be swallowed (but not in a single digit, as in five and thirty) And obviously the ancient five and thirty does not convert (just an observation, no CR)

As it is, it is good enough for me, but is may still have some issues (it looses -, and maybe more, punctuation marks perhaps) :( it does handle this, though he has a stentorian voice -> he has a stentorian voice --> ['he', 'has', 'a', 'stentorian', 'voice']

Feel free to use/modify

Greetings Ludger

Am 12.01.2022 um 21:40 schrieb Dhananjay Pai:

Right now while parsing, the input string is tokenized and then it extends up to be ['hundredandfive', ' ', 'thirtyone', ' ', 'and', ' ', 'some', ' ', 'other', ' ', 'text']. So basically 'hundredandfive' and 'thirtyone' will not be parsed as spaces are expected in between two numbers right now. /eg: fiftynine-> fifty nine = 59, hundredandfifty->hundred and fifty = 150/

expected input string: |parse('hundred and five thirty one and some other text')|

@Gallaecio https://github.com/Gallaecio can we come out with some way where spaces between two numbers converging to be one(eg: thirtyone-> 31) is not required?

— Reply to this email directly, view it on GitHub https://github.com/scrapinghub/number-parser/issues/75#issuecomment-1011433546, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXITWJFTC2ATF5EMZQEYJNTUVXRL5ANCNFSM5LYSO7YA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

dhananjaypai08 commented 2 years ago

Okay so you have to parse 'hundred' and 'five' separately in the parse function. The thing is in number_parser number tokens are build and added simultaneously so even if 'and' was not included in the string still it would parse '105 and 31 and some other text'. The only 'and' overridden is the one between hundred and five which builds a number. Even if that one "and" was not present in the string still it would be parsed as expected.

input_string = "hundred and five thirty one and some other text" output_string = "105 and 31 and some other text" Works fine I guess. If you need to parse hundred and five separately would be a different case where using parse_number would be nice. Or I didn't understand your issue on the first place? Let me know