Handle homophones (similar sounding words)

pixeye33 commented 11 months ago

I have a custom sentence in french like this : [mets] [le] volume [à|a] {volume} [pourcent|%] (it means, put the volume to x percent)

Often, Nabucasa STT understands mais and the sentence does not match. to make it work i changed to [mets|mais] [le] volume [à|a] {volume} [pourcent|%]

[mets|mais] are two words that means put for the first one, and but for the second one. both are similar sounding words, confusing them is a dyslexia symptom.

We (probably) can't fix what Nabucasa STT says it understands, but maybe we can change the displayed sentence in the prompt ?

I thought of something like this : [mets!|mais] [le] volume [à|a] {volume} [pourcent|%] with ! after the word that is the correct one.

that ways even if Nabucasa STT understands mais volume à 10 % it will write it as mets volume à 10 % this could also work in expansion rules.

to go even further, we could even imagine displaying mets le volume à 10 % even if mais volume 10 was understood/said. with a sentence written like this : [mets!|mais]! [le]! volume [à|a]! {volume} [pourcent|%!]! notice the ! after the ] as a way to allow the word not to be said, but still keeping it in the displayed sentence. when using []! syntax if there is no ! inside [] it will keep the first option.

side note : i do feel like this was more a https://github.com/home-assistant/hassil issue, but most of the other in this project too.

tetele commented 11 months ago

We (probably) can't fix what Nabucasa STT says it understands, but maybe we can change the displayed sentence in the prompt ?

You can't do that. Whatever the STT engine understands (whichever STT engine you use), that's what gets passed on in the pipeline to the intent recognition engine, not the other way around.

That's only fixable in one of two ways:

better STT model for your language (this is outside HA's scope)
if the STT engine supports it, multiple potential STT results with corresponding confidence scores, all of which should be parsed, in order of confidence, until one of them matches a possible sentence (this is in HA core's scope, not the intents repo)

I am not sure the second option is available in the underpinnings of the Nabu Casa STT engine, but maybe it's something @synesthesiam wants to take a look at.

notice the ! after the ] as a way to allow the word not to be said, but still keeping it in the displayed sentence.

What is the use case here? I mean... if you don't say the words and the recognized words get passed on to the intent recognition service (e.g. volume à 60 %), which needs to match to your sentence ([mets!|mais]! [le]! volume [à|a]! {volume} [pourcent|%!]!), who cares what gets displayed? The intent recognition has already taken place at this point, right?

pixeye33 commented 11 months ago

You can't do that. Whatever the STT engine understands (whichever STT engine you use), that's what gets passed on in the pipeline to the intent recognition engine, not the other way around.

I'm aware, i'm not suggesting intent -> STT engine flow.

here is put another way : I say mets STT understands mais Intent recognition engine matches [mets!|mais] what i'm suggesting is to rewrite the end result, unsing the intent string that matched as a "regexp" : mets instead of mais. only for display purposes. I don't care if STT understood mais, as long as the action is triggered, but it makes me cry to see mais written, as if i did not know how to spell.

Confidence scores, if they exist are probably the better way, i agree : I say mets STT understands mais (90%) or mets (70%) Intent matches [mets]

but the implementation of that solution is probably more complex, far in the future and SST engine dependant, than a simple rewrite of the displayed text.

What is the use case here? who cares what gets displayed? The intent recognition has already taken place at this point, right?

Yes, intent has already happen, action too. but reading a full sentence is more apealing, than a bunch of keywords.

Note : I'm conviced that in the future, we will say less words (laziness), and yet have a full sentence written (more satisfying), this was a way to have that without much additional work.

tetele commented 11 months ago

what i'm suggesting is to rewrite the end result, unsing the intent string that matched as a "regexp" : mets instead of mais. only for display purposes.

Written where? In the Assist dialog box? That gets displayed before any intent matching is done.

Also, the plan is to only recognize grammatically correct sentences in order to train a recognition model that can "catch" more sentences than just those which were manually defined, so recognizing mais le volume a 90% is just a band-aid on a broken bone which will do more harm than good in the long run.

thdg commented 11 months ago

My two cents on this. Handling common errors that the STT system does makes the system more robust and should be highly recommended where needed. Trying to rewrite the sentence might potentially be useful someday but seems to be over complicating things for now.

I recommend doing something like this:

[mais|<stt_error_mais>] [le] volume [à|a] {volume} [pourcent|%] with an expansion rule: stt_error_mais: "mets"or stt_mais: "(mets|...|...)" if there are other common errors (replace the dots with the other errors)

This accomplishes two things:

The added robustness adds minimal cluttering to the sentences
If the time comes, it is easy to add a fallback word for the error: stt_error_mais -> "mais"

synesthesiam commented 11 months ago

I handled this kind of issue in Rhasspy with a ":" operator in sentence templates, so "mais:mets" would match "mais" but output "mets". There were two output sentences too, one with the literally recognized text (mais) and one with the transformed text (mets).

I could see adding this to hassil, but we need to make a clear case for it. I don't want to mask STT errors, but we also want to be robust to them.

tetele commented 11 months ago

Another example of the same issue, this time in German https://github.com/home-assistant/intents/pull/1373#issuecomment-1673842039

Kelesis commented 8 months ago

For information, another similar issue is for numbers (1 or one), especially for range of numbers, I have this issue in french but it might be the same for english : I say "Chronomètre 1 minute" ("Time 1 minute") and the returned text is "Chronomètre une minute" ("Time one minute") so the range doesn't work. The workaround is to create a specific sentence, but if I want a time like x hours x minutes x seconds ... I have to define all possible combinations 1xx x1x xx1 xxx ... not very nice 😅 For all other numbers the returned text is made of digits 0123456789 and works as expected. Maybe when expecting a range, template matcher could accept written numbers? A lot of work :'(

Edit : I finally found a better workaround using only one sentence based on default value defined by slot.

Screenshot_20231112-154629

tetele commented 8 months ago

@Kelesis that specific problem regarding numbers has been addressed and will be included in the following releases

X-Ryl669 commented 8 months ago

How hard would it be to match intent not by their text but by their SOUNDEX or equivalent algorithm? The idea would be to:

While building the intent possibility tree : convert the intent to SOUNDEX or a sequence of phoneme
Convert STT output to SOUNDEX (or a sequence of phoneme) too
Compute similarity between the latter and each node of the tree, ranking the higher matching first (maybe even doing a leveinstein search on the tree so we can stop search for all intents after a given number of insertion/substitution errors ?
Drop any match if it's below a given matching threshold

What do you think?

X-Ryl669 commented 8 months ago

stt_error_mais: "mets" or stt_mais: "(mets|...|...)" if there are other common errors (replace the dots with the other errors)

This is not wanted since it create exponential growth on the potential sentences. In French (and probably other language) there are multiple spelling for the same sound (like "Ouvrer / Ouvrez / Ouvré / Ouvrés / Ouvrée / Ouvrait / Ouvraient / ...") so a simple sentence with "Ouvrez les volets roulants" could be written as "Ouvrer les volets roulants" (which is perfectly correct grammatically and semantically) even "Ouvre haie lait veau lé roue lent" (incorrect grammatically and semantically). The STT engine can't decide on the former or the latter since there's no context, so it's perfectly right to choose either one (and it's 100% correct doing so). So it's wrong to blame STT here.

In a YAML, you can't list all possibility and it would be impossible to match against those even if you did.

tetele commented 8 months ago

How hard would it be to match intent not by their text but by their SOUNDEX or equivalent algorithm?

That sounds a lot like this suggestion, doesn't it?

X-Ryl669 commented 8 months ago

Exactly like this. Thanks for linking it!

synesthesiam commented 8 months ago

The discussion @tetele linked has more info, but in short the plan is to have HA attempt matches first without and then with fuzzy recognition enabled in hassil. This is happening in text, though, so it's not as ideal as using something like SOUNDEX. However, we need to support many more languages than just English.

X-Ryl669 commented 6 months ago

Hey @synesthesiam, please have a look to my tinkering here and more specifically the tests (run with hatch run test:pytest -s) for example usage and the tests for what it's able to match.

I've used Epitran to support many languages (close to a hundred) for G2P and implemented an fussy intent matching on top based on a IPA mapping. The intent are built using a tree where you have either a simple Basic node (simple text that must be here), an Optional node (a text that can be missing), a greedy Parametric node (a value, like forty two) or an Alternative node (some text or some other text).

I think it should match more or less what HA intent type that exist. Yet, it's able to match sentences like: Fermée le veau les for intent expecting Fermez les volets (that a Leveishtein algorithm can't match easily on textual space) because it's converting both to IPA strings first and then doing a kind of Levenstein match on the IPA space.

home-assistant / intents

Handle homophones (similar sounding words) #1493