The option we explored for triggering speech is the hand up/down detection through vision.
We tried using both 2D and 3D skeleton. Even if the 3D approach is more robust (#246), vision alone has important limitations. First, it assumes the person to always be in the FOV, which cannot be guaranteed in a HRI scenario. Moreover, for a TUG scenario, when the person starts walking, the robot focuses on lower limbs to extract motion metrics. In order to observe the whole skeleton, given the narrow FOV of the camera, the distance robot - person should be considerably high.
I'll describe here a list of options that came into our mind and we might want to explore.
1. Combining vision and sound
This option might overcome the limitations of vision and sound used alone.
If we use a predefined vocal command alone to trigger the speech (for example "Hey R1"), the system can either be always listening, being very sensitive to noise, or listening for a predefined amount of time, which is unrelated to the length of the sentence, and might thus result too big or small.
One solution could be using the vocal command to focus the robot's attention on the upper limbs and the hand up/down detection to provide the start/stop. Another solution could be using the vocal command to provide the start and the hand down detection the stop.
Limitations:
this might be complicated from the point of view of the user (the interaction might result not fluid) and of the software (it might not be straightforward to synchronize sound and vision).
the user should be equipped with a microphone
2. Equipping the user with a button
The user has to keep pressed a button while asking questions. A possible device could be this, which can be directly connected to the wifi. A REST API is available. Pressing the button could also trigger additional features, such as person following.
Limitations: the user should be equipped with a microphone
Advantages: robustness
3. Developing an app for the phone
The user interacts with the robot through the phone for asking questions. This would allow us to both manage the start/stop and remove the external microphone. We could use yarp.js.
Limitations: it could be difficult to adopt for patients walking with aid supports.
Advantages: we could remove the external microphone
We could develop both options 2 and 3 and use one or the other according to the specific case.
The option we explored for triggering speech is the hand up/down detection through vision. We tried using both 2D and 3D skeleton. Even if the 3D approach is more robust (#246), vision alone has important limitations. First, it assumes the person to always be in the FOV, which cannot be guaranteed in a HRI scenario. Moreover, for a TUG scenario, when the person starts walking, the robot focuses on lower limbs to extract motion metrics. In order to observe the whole skeleton, given the narrow FOV of the camera, the distance robot - person should be considerably high.
I'll describe here a list of options that came into our mind and we might want to explore.
1. Combining vision and sound
This option might overcome the limitations of vision and sound used alone. If we use a predefined vocal command alone to trigger the speech (for example "Hey R1"), the system can either be always listening, being very sensitive to noise, or listening for a predefined amount of time, which is unrelated to the length of the sentence, and might thus result too big or small. One solution could be using the vocal command to focus the robot's attention on the upper limbs and the hand up/down detection to provide the start/stop. Another solution could be using the vocal command to provide the start and the hand down detection the stop.
Limitations:
2. Equipping the user with a button
The user has to keep pressed a button while asking questions. A possible device could be this, which can be directly connected to the wifi. A REST API is available. Pressing the button could also trigger additional features, such as person following.
Limitations: the user should be equipped with a microphone Advantages: robustness
3. Developing an app for the phone
The user interacts with the robot through the phone for asking questions. This would allow us to both manage the start/stop and remove the external microphone. We could use
yarp.js
.Limitations: it could be difficult to adopt for patients walking with aid supports. Advantages: we could remove the external microphone
We could develop both options 2 and 3 and use one or the other according to the specific case.
cc @pattacini @vtikha