Explore new options for triggering speech

The option we explored for triggering speech is the hand up/down detection through vision. We tried using both 2D and 3D skeleton. Even if the 3D approach is more robust (#246), vision alone has important limitations. First, it assumes the person to always be in the FOV, which cannot be guaranteed in a HRI scenario. Moreover, for a TUG scenario, when the person starts walking, the robot focuses on lower limbs to extract motion metrics. In order to observe the whole skeleton, given the narrow FOV of the camera, the distance robot - person should be considerably high.

I'll describe here a list of options that came into our mind and we might want to explore.

1. Combining vision and sound

This option might overcome the limitations of vision and sound used alone. If we use a predefined vocal command alone to trigger the speech (for example "Hey R1"), the system can either be always listening, being very sensitive to noise, or listening for a predefined amount of time, which is unrelated to the length of the sentence, and might thus result too big or small. One solution could be using the vocal command to focus the robot's attention on the upper limbs and the hand up/down detection to provide the start/stop. Another solution could be using the vocal command to provide the start and the hand down detection the stop.

Limitations:

this might be complicated from the point of view of the user (the interaction might result not fluid) and of the software (it might not be straightforward to synchronize sound and vision).
the user should be equipped with a microphone

2. Equipping the user with a button

The user has to keep pressed a button while asking questions. A possible device could be this, which can be directly connected to the wifi. A REST API is available. Pressing the button could also trigger additional features, such as person following.

Limitations: the user should be equipped with a microphone Advantages: robustness

3. Developing an app for the phone

The user interacts with the robot through the phone for asking questions. This would allow us to both manage the start/stop and remove the external microphone. We could use yarp.js.

Limitations: it could be difficult to adopt for patients walking with aid supports. Advantages: we could remove the external microphone

We could develop both options 2 and 3 and use one or the other according to the specific case.

cc @pattacini @vtikha

robotology / assistive-rehab