spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.4k stars 654 forks source link

smarter number sequences - 'one one twenty one' #462

Open hboylan opened 6 years ago

hboylan commented 6 years ago

Found another head-scratcher... At first I thought it was merely a limitation, but the issue appears to only pop up if there are an even number of individual digits in front of a multi-digit value.

Hopefully this example explains more clearly:

const nlp = require('compromise')
const arr = str => nlp(str).values().out('array')

// digits leading
console.log(arr('one twenty one')) // ['one', 'twenty one']
console.log(arr('one one twenty one')) // ['one one twenty one'] (x)
console.log(arr('one one one twenty one')) // ['one', 'one', 'one', 'twenty one']
console.log(arr('one one one one twenty one')) // ['one one one one twenty one'] (x)

// digits trailing
console.log(arr('one twenty one one')) // ['one', 'twenty one', 'one']
console.log(arr('one twenty one one one')) // ['one', 'twenty one', 'one', 'one']

I'm merely using "one" and "twenty one" to demonstrate the point, but it occurs with any even number of individual digits in front of a multi-digit value, but not after. :thinking:

FYI, our team is trying to use this fantastic library to capture "creative pronunciations" of values.

spencermountain commented 6 years ago

ah, nice find Hugh. Yeah, this is pretty doable. I'm going on vacation next week, so may not get to it before April. If you, (or anyone) is inclined to tackle this, this lumping/splitting is found here

afaik, any numbers 0-9 should not be lumped together, same for teen-numbers, or multiples like thousand thousand (except for hundred thousand!)

open to a PR

hboylan commented 6 years ago

Great, I'll take a look at this.

Enjoy your vaca! :sunglasses:

hboylan commented 6 years ago

Tinkered with this over the weekend. Was able to patch this root issue, but it didn't work with #Multiple in the value.

At any rate, after careful consideration, I think declaring values like this is too ambiguous anyway. (ie. is "one one twenty one" = 1121 or 11201?) It actually depends on the short pauses between the digits when spoken.

Going to close this out. If anyone else runs into something like this, I'd recommend having your chatbot prompt the user to pronounce the number/value a different way.

scagood commented 6 years ago

This reminds me of the way we (in the UK) say phone numbers. Lets, say you have the following UK phone number; 07770 11 22 33. There are two different ways of saying this, you could read it directly, for example:

0 seven seven seven 0 one one two two three three

Or, in the UK you could/would say:

0 triple seven 0, double one, double two, double three

I suppose there are other ways of saying phone numbers too, I just cant think of any.

Is that the sort of thing you were thinking about @hboylan?

Also, I have heard one twenty one said before, using your first example

spencermountain commented 6 years ago

yeah, I'll leave this open, we should fix this.

Also, I have heard one twenty one said before, using your first example

ooh, good point.

hboylan commented 6 years ago

@scagood Correct, but it's tough to accurately decipher certain "creative pronunciations" like this. compromise does a great job with most, but certain combinations ambiguous.

Lets, say you have the following UK phone number; 07770 11 22 33.

In the US, this could be pronounced: oh seven seventy seven oh eleven twenty two thirty three

Or for a US phone number, 555-123-4567: five fifty five one twenty three forty five sixty seven

Gets a bit ambiguous when trying to convert these into number values: ['5', '55', '123', '45', '67'] :heavy_check_mark: vs. ['5', '55', '1', 23', '45', '67'] :heavy_check_mark: vs. ['5', '55', '120', '340', '567'] :x: etc.

hboylan commented 6 years ago

I believe the ultimate goal here is to capture nominal values in addition to cardinal and ordinal.

My thought is that the lumping/splitting would still need to be enhanced to capture more advanced number sequences. Then, there might be another step to determine whether the value is nominal. Maybe by using a separate toNominal() function or something...

cardinal-ordinal-nominal

scagood commented 6 years ago

This is a nice idea, how do you think we could differentiate your three categories in text?

I would assume there would have to have a system in place to identify Nominal numbers. Or possibly the inverse to identify 'not' nominal numbers, using a naiive match like '#Value #Noun'?

This would have to be situational I think? By this I mean different places all over the world use different systems, so simply putting an expected phone number just wouldn't do in other countries. As an example, here in the UK we use a phone system that would match a regex like this (assuming no white space):

let ukPhone = new RegExp('(' +
    '(?:0|0044|+44)(?:' +
        '(?:1\d{8})|' +
        '(?:[1-37-9]\d{9})|' +
        '(?:5[56]\d{8})|' +
        '(?:[58]00(?:\d{6}|1{4}))|' +
        '(?:845464\d)|' +
        '(?:\d{6})|' +
        '(?:147[0157])|' +
        '(?:1571)|' +
        '(?:999)|' +
        '(?:10[15])|' +
        '(?:11[128])|' +
        '(?:123)' +
    ')' +
')');

See: https://en.wikipedia.org/wiki/Telephone_numbers_in_the_United_Kingdom regarding the uk phone mess system :')

Where as the US phone system as an example is completely different. A example being 555-123-4567 does not match the previous regex.

This then generally means the nominal numbers cant be predicted unless previously added to the environment/world. If the system is expecting a nominal number in the form \d{3}\D?\d{3}\D?\d{4} an output like: 5 55 123 45 67 (after removing whitespace) is probable, however, another interpretation of 5 55 120 340 567 is less likely/wont match, therefore can be interpreted as cardinal.

Would that sort of thing be along the lines o what you're thinking?

Therefore, a prototype akin to this:

nlp(...).toNominal(/\d{3}\D?\d{3}\D?\d{4}/)

Or possibly this:

doc = nlp('...', {
    nominals: {
        usPhone,
        ukPhone,
        bankCode,
        ...
    }
})

doc.toNominal();

Might work?

Am I along the right lines? Can you think of a better soloution?

Should this be a plugin esq. item instead of built in to the core of compromise as it seems very domain specific?

hboylan commented 6 years ago

@scagood Since nominal numbers can be represented in so many different formats, I like the idea of passing a custom Regex to the function:

nlp(...).values().toNominal(/\(?\d{3}\)\s*?-?\d{3}\s*-?\d{4}/)

// example
nlp('my pin number is one two twenty').values().toNominal(/\d{4}/)

Generally, cardinal numbers seem to appear individually:

"forty two memes" -> 42 memes

While nominal numbers usually appear together:

"spartan one one seven" -> spartan 117
scagood commented 6 years ago

The other thing that may be worth noting is that the processing of numbers like this would take longer, as the #Value would most likely have to split into ngrams and processed, using that technique there would be a high chance that multiple outputs will be produced. As an example; let us suppose we're looking for /\d{5}/ and the string was twenty two twenty two possible interpretations could be; 20222 or 22202.

Is spartan one one seven not also ordinal, as it's the 117th spartan?

Another problem that comes to mind is the difference in large numbering systems in the EU and US, Here is the obvious problem:

Number US Word EU Word
1,000,000 Million Million
1,000,000,000 Billion Milliard
1,000,000,000,000 Trillion Billion

etc.

Meaning for larger numbers like, this number: 555-123-4567 if you're after really open parsing for example five billion five hundred and fifty one million two hundred and thirty four thousand five hundred and sixty seven (not that this is parsed correctly as is. (5000000551000234000)) could be 5551234567 or 5000551234567, which are clearly subtly different 😆.

My point here is how open does the parsing need to be?

The last thing that comes to mind, is what if a number set could overlap/match each other? Do both get output, or just the first one, or some kind of context based magic that picks one?