Open hboylan opened 6 years ago
ah, nice find Hugh. Yeah, this is pretty doable. I'm going on vacation next week, so may not get to it before April. If you, (or anyone) is inclined to tackle this, this lumping/splitting is found here
afaik, any numbers 0-9 should not be lumped together, same for teen-numbers, or multiples like thousand thousand
(except for hundred thousand!)
open to a PR
Great, I'll take a look at this.
Enjoy your vaca! :sunglasses:
Tinkered with this over the weekend. Was able to patch this root issue, but it didn't work with #Multiple
in the value.
At any rate, after careful consideration, I think declaring values like this is too ambiguous anyway. (ie. is "one one twenty one" = 1121 or 11201?) It actually depends on the short pauses between the digits when spoken.
Going to close this out. If anyone else runs into something like this, I'd recommend having your chatbot prompt the user to pronounce the number/value a different way.
This reminds me of the way we (in the UK) say phone numbers. Lets, say you have the following UK phone number; 07770 11 22 33. There are two different ways of saying this, you could read it directly, for example:
0 seven seven seven 0 one one two two three three
Or, in the UK you could/would say:
0 triple seven 0, double one, double two, double three
I suppose there are other ways of saying phone numbers too, I just cant think of any.
Is that the sort of thing you were thinking about @hboylan?
Also, I have heard one twenty one
said before, using your first example
yeah, I'll leave this open, we should fix this.
Also, I have heard
one twenty one
said before, using your first example
ooh, good point.
@scagood Correct, but it's tough to accurately decipher certain "creative pronunciations" like this. compromise
does a great job with most, but certain combinations ambiguous.
Lets, say you have the following UK phone number; 07770 11 22 33.
In the US, this could be pronounced:
oh seven seventy seven oh eleven twenty two thirty three
Or for a US phone number, 555-123-4567:
five fifty five one twenty three forty five sixty seven
Gets a bit ambiguous when trying to convert these into number values:
['5', '55', '123', '45', '67']
:heavy_check_mark:
vs.
['5', '55', '1', 23', '45', '67']
:heavy_check_mark:
vs.
['5', '55', '120', '340', '567']
:x:
etc.
I believe the ultimate goal here is to capture nominal values in addition to cardinal and ordinal.
My thought is that the lumping/splitting would still need to be enhanced to capture more advanced number sequences. Then, there might be another step to determine whether the value is nominal. Maybe by using a separate toNominal()
function or something...
This is a nice idea, how do you think we could differentiate your three categories in text?
I would assume there would have to have a system in place to identify Nominal numbers. Or possibly the inverse to identify 'not' nominal numbers, using a naiive match like '#Value #Noun'?
This would have to be situational I think? By this I mean different places all over the world use different systems, so simply putting an expected phone number just wouldn't do in other countries. As an example, here in the UK we use a phone system that would match a regex like this (assuming no white space):
let ukPhone = new RegExp('(' +
'(?:0|0044|+44)(?:' +
'(?:1\d{8})|' +
'(?:[1-37-9]\d{9})|' +
'(?:5[56]\d{8})|' +
'(?:[58]00(?:\d{6}|1{4}))|' +
'(?:845464\d)|' +
'(?:\d{6})|' +
'(?:147[0157])|' +
'(?:1571)|' +
'(?:999)|' +
'(?:10[15])|' +
'(?:11[128])|' +
'(?:123)' +
')' +
')');
See: https://en.wikipedia.org/wiki/Telephone_numbers_in_the_United_Kingdom regarding the uk phone mess system :')
Where as the US phone system as an example is completely different. A example being 555-123-4567
does not match the previous regex.
This then generally means the nominal numbers cant be predicted unless previously added to the environment/world.
If the system is expecting a nominal number in the form \d{3}\D?\d{3}\D?\d{4}
an output like: 5 55 123 45 67
(after removing whitespace) is probable, however, another interpretation of 5 55 120 340 567
is less likely/wont match, therefore can be interpreted as cardinal.
Would that sort of thing be along the lines o what you're thinking?
Therefore, a prototype akin to this:
nlp(...).toNominal(/\d{3}\D?\d{3}\D?\d{4}/)
Or possibly this:
doc = nlp('...', {
nominals: {
usPhone,
ukPhone,
bankCode,
...
}
})
doc.toNominal();
Might work?
Am I along the right lines? Can you think of a better soloution?
Should this be a plugin esq. item instead of built in to the core of compromise as it seems very domain specific?
@scagood Since nominal numbers can be represented in so many different formats, I like the idea of passing a custom Regex to the function:
nlp(...).values().toNominal(/\(?\d{3}\)\s*?-?\d{3}\s*-?\d{4}/)
// example
nlp('my pin number is one two twenty').values().toNominal(/\d{4}/)
Generally, cardinal numbers seem to appear individually:
"forty two memes" -> 42 memes
While nominal numbers usually appear together:
"spartan one one seven" -> spartan 117
The other thing that may be worth noting is that the processing of numbers
like this would take longer, as the #Value
would most likely have to split into ngrams and processed, using that technique there would be a high chance that multiple outputs will be produced.
As an example; let us suppose we're looking for /\d{5}/
and the string was twenty two twenty two
possible interpretations could be; 20222
or 22202
.
Is spartan one one seven
not also ordinal, as it's the 117th spartan?
Another problem that comes to mind is the difference in large numbering systems in the EU and US, Here is the obvious problem:
Number | US Word | EU Word |
---|---|---|
1,000,000 | Million | Million |
1,000,000,000 | Billion | Milliard |
1,000,000,000,000 | Trillion | Billion |
etc.
Meaning for larger numbers like, this number: 555-123-4567
if you're after really open parsing for example five billion five hundred and fifty one million two hundred and thirty four thousand five hundred and sixty seven
(not that this is parsed correctly as is. (5000000551000234000)) could be 5551234567
or 5000551234567
, which are clearly subtly different 😆.
My point here is how open does the parsing need to be?
The last thing that comes to mind, is what if a number set could overlap/match each other? Do both get output, or just the first one, or some kind of context based magic
that picks one?
Found another head-scratcher... At first I thought it was merely a limitation, but the issue appears to only pop up if there are an even number of individual digits in front of a multi-digit value.
Hopefully this example explains more clearly:
I'm merely using "one" and "twenty one" to demonstrate the point, but it occurs with any even number of individual digits in front of a multi-digit value, but not after. :thinking:
FYI, our team is trying to use this fantastic library to capture "creative pronunciations" of values.