phetsims / ratio-and-proportion

"Ratio and Proportion" is an educational simulation in HTML5, by PhET Interactive Simulations.
GNU General Public License v3.0
1 stars 4 forks source link

Implement handtracking for left and right hand vertical positions using MediaPipe #431

Closed brettfiedler closed 2 years ago

brettfiedler commented 2 years ago

We would like to implement a simple handtracking that maps the left and right hand vertical positions in Ratio and Proportion to the vertical height in the detection window using the machine learning library produced by MediaPipe and a device's camera which @samreid implemented as a demo for Build an Atom (https://github.com/phetsims/tangible/issues/7#issuecomment-936658746)

MediaPipe's page here has the Hand tracking as well as some other cool tracking abilities (possibly for our future consideration... the Pose tracking looks very interesting...)

https://google.github.io/mediapipe/

I highly recommend checking out the demo to see the limitations for left/right hand detection, how quickly your hands can move, what hand shapes are supported, and how close your hands can be and still be differentiated.

We'll need to decide how the sims handles loss of detection/occlusion, but whatever is simplest for the first iteration will be best.

brettfiedler commented 2 years ago

Updated text above! Let me know if you would like to meet to chat about anything @zepumph

zepumph commented 2 years ago

I had some good luck with this script. After doing a bit of investigation, it will be easy to add to the sim, and it is really responsive, but I still have questions about how to bring it to production. I can't seem to figure out how to host all files locally (like by including them as preloads). For research, we just have to expect them to load from network connections. Its a bit less than 20MB each load, so it isn't negligible. I'll just start by adding it to a dev version under a query parameter.

zepumph commented 2 years ago

I have a prototype working. Here are some questions and next steps:

Here is the patch to use the MediaPipe types in ratio and proportion. I can't commit this because it isn't hidden behind a query parameter just yet.

```diff Index: ratio-and-proportion_en.html IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP <+>UTF-8 =================================================================== diff --git a/ratio-and-proportion_en.html b/ratio-and-proportion_en.html --- a/ratio-and-proportion_en.html (revision df990376fb927d060ca3b0f382d835bdfcd2df07) +++ b/ratio-and-proportion_en.html (date 1646438680693) @@ -9,10 +9,15 @@ ratio-and-proportion + + + + + ```
zepumph commented 2 years ago
zepumph commented 2 years ago

@terracoda and I had a couple of fun ideas about this today:

samreid commented 2 years ago

What if open hands means you can move them, and then you close the hands to "let go" of the ratio. This would give an end drag voicing response and a let go sound,.

Have you discussed the opposite? It may be more natural to grab the hands by grasping, then release them by releasing.

brettfiedler commented 2 years ago

To discuss:

brettfiedler commented 2 years ago

re: Tangible Input Tolerance widening for Quadrilateral: https://github.com/phetsims/quadrilateral/issues/116

zepumph commented 2 years ago

Meeting notes:

NEXT STEPS:

brettfiedler commented 2 years ago

@terracoda , would you have a moment before next Wednesday (4/13) to try out the hand tracking with Voicing and comment a bit on how it feels compared to its intended design? (if not, feel free to unassign)

To test, open the latest on phettest and append the mediaPipe query parameter. You may also find showVideo helpful, but it may reduce performance on your machine.

brettfiedler commented 2 years ago

Related to Quadrilateral issue https://github.com/phetsims/quadrilateral/issues/116, we've discussed smoothing out the haptic experience when using tangibles or computer vision when using simulations that use tolerances (intervals around a target value that provide modal feedback). One way to accomplish this is to add "forgiveness" when a user achieves the target state, making it more difficult to break out of once they have entered it. This approach balances the need for precision to achieve the target state (necessary, especially for the current sims focused on mathematical concepts) and the need to account for human perception/mobility in dynamic movements. It also aims to address some feedback overload when near target values that trigger feedback (e.g., in-proportion sound) multiple times in a short time window.

For Ratio and Proportion, we wish this to be tied only to the case where the input device is a tangible (driven by microcontroller or similar) or computer vision (e.g., MediaPipe or OpenCV).

Target State: In Proportion

In general, we wish to, in every case, keep the more precise (smaller) tolerance interval default to the simulation whenever the simulation is not "in-proportion" (target state). If the simulation is in the target state, the tolerance interval should increase by some factor. Precision is still important and the factor should not increase such that there is significant play in with the hands while in the target state. This may require some small iteration.

Dynamic Target State: Moving in Proportion

In Ratio and Proportion, the dynamic, velocity-gated feedback already accounts for imprecise human movement and is sufficiently large to maintain "moving in proportion" feedback.

@zepumph noted that there are a few cases that may require special attention when the hands enter and leave detection and interaction begins or ends to decide what interval to use for the new hand positions:

Start interaction tolerance values: (Pre-interaction -> Start of interaction; In Proportion (IP), Out of Proportion (OP))
  1. IP1 -> IP1: Standard interval (no factor)
  2. IP1 -> IP2: Standard interval (no factor)
  3. OP -> OP: Standard interval (no factor)
  4. IP1 -> OP: Standard interval (no factor)
  5. OP -> IP(any): Standard interval (no factor)
End interaction tolerance values: (Interacting -> End of interaction; In Proportion (IP), Out of Proportion (OP))
  1. IP1 -> IP1: Standard interval (no factor)
  2. IP1 -> IP2: Standard interval (no factor)
  3. OP -> OP: Standard interval (no factor)
  4. IP1 -> OP: Standard interval (no factor)
  5. OP -> IP(any): Standard interval (no factor)

For clarity, every tolerance interval biases the newest state of the sim given where the hands are. Beginning and ending with different tolerance intervals may introduce some strange behavior, but there is also likely to be a lot of movement as participants enter and leave the detection window anyway as they orient their hands and reorient their movements to align with the simulation. Feedback at these moments likely will not need to be precise. It's more important that the experience during continuous detection feel smooth. In other words, I don't think anyone is going to move carefully enough or remember exactly where there hands were the last time they were interacting with the simulation, even if we can "release the hands". How would they put their hands back in the exact same position to resume?

terracoda commented 2 years ago

What is the query parameter fro media pipe?

terracoda commented 2 years ago

@zepumph and I discussed on Friday some ideas to improve the Voicing. Connecting to https://github.com/phetsims/ratio-and-proportion/issues/454#issuecomment-1100222835

terracoda commented 2 years ago

Originally, we thought that the hand tracking would be very similar to the voicing experience with mouse interaction (non-discrete input), but for hand tracking there is no end drag event. See https://github.com/phetsims/ratio-and-proportion/issues/454#issuecomment-1100222835 for active hand gestures that that might make the Voiced experience with hand tracking more flexible, and controllable, and inclusive.

brettfiedler commented 2 years ago

After chatting with EM, it's clear we need a good understanding of the publication possibilities with MediaPipe before we can progress too far.

I imagine the initial options are something like: 1.) What we have right now, with a required internet connection on load and the mediaPipe load dependent on the mediaPipe query parameter. 2.) A large download? for offline access?

Following that, it'd be great to have a discussion (with all interested parties) to determine what kinds of general gesture support we can build for PhET sims, beginning with Ratio and Proportion as a case study. By general, we mean moving focus and interacting with common UI components that aren't necessarily free moving (e.g., hands in RaP, balloons in BASE, magnet in Faraday's Law). We'll plan a meeting to discuss this when we determine if we can publish sims with Mediapipe.

zepumph commented 2 years ago

I will get to this on Wednesday

zepumph commented 2 years ago

Sorry for the delay here. To summarize the aspects of publishing a sim with MediaPipe, we can definitely move forward with this requiring internet connection. Over in https://github.com/phetsims/tangible/issues/9#issuecomment-1107072513 I again tried to get these files to be delivered offline, but failed. And so I feel like the best path is knowing that loading MediaPipe will download ~20MB of data files each run. This adds ~5 seconds to the load time.

I still do not think that fully offline access is impossible, but I think the best path forward there would be to bring in a senior dev for assistance (I think JO would be best). Within 1 hour of pairing with him we could determine a path forward, or if there is an impasse.

Please note that the 20MB size has to come from somewhere, whether it is bundled in the sim, or retrieved from MediaPipe's cdn each time the sim runs.

Other worries I have about internet connection is that we are tied to their API and decision-making. There is no versioning on any of the files we have, so I have no trust that in 3 years the links will keep the same files. That is the main desire for getting files bundled within the sim.

Following that, it'd be great to have a discussion (with all interested parties) to determine what kinds of general gesture support we can build for PhET sims, beginning with Ratio and Proportion as a case study. By general, we mean moving focus and interacting with common UI components that aren't necessarily free moving (e.g., hands in RaP, balloons in BASE, magnet in Faraday's Law). We'll plan a meeting to discuss this when we determine if we can publish sims with Mediapipe.

Yes I agree. With minimal effort I think that we could build this control into scenery proper based on the current mouse/touch support, and then add in custom controls as we want to (like for the hands). In my experience so far, I have complete trust in MediaPipe's ability to detect pretty complicated positions for hands that we can map into a combination of general controls and sim-specific interactions.

Let me know if you want more thoughts about this.

terracoda commented 2 years ago

The meeting should include @emily-phet, right?

brettfiedler commented 2 years ago

Thanks @zepumph!

Yes, let's arrange a meeting with the 4 of us (@zepumph, @terracoda, @emily-phet, and me) to discuss. I'll put out the feelers for a common time for gesture discussion!

I think pairing with JO or whoever is needed is definitely a good use of time to try for offline access. I agree versioning (for a library explicitly indicated as an "alpha") is likely to be of concern down the line.

@emily-phet, do you agree?

emily-phet commented 2 years ago

@BLFiedler @zepumph Yes, I agree with Brett to @zepumph suggestion, pairing with a senior dev should be one of the next steps.

If it's the case that there is no way around a sim with the kind of input we're talking about here (computer vision tracking naturalistic hand movement) weighing in at ~20mb, then there are some follow-up questions that need to be addressed.

The issues with larger file size include:

Some follow-up questions include:

I think the worst case scenario is the following - if there is no way to get the file size for a sim using MediaPipe into there realm of typical for a larger PhET sim, AND there is no reasonable mechanism on the PhET website for someone to choose between two versions of a sim, then we may put a lot of work into setting up the use of MediaPipe as a standard PhET feature, but cannot actually provide easy, intuitive access to sims with that feature, which will limit or completely stall uptake of the feature (by PhET broadly and by users).

zepumph commented 2 years ago

We haven't posted here in a while, even though we all had a meeting just after the most recent comment. I have completed https://github.com/phetsims/tangible/issues/9, and MediaPipe is now entirely offline, and bundled with the sim (please note https://github.com/phetsims/ratio-and-proportion/issues/464). I think we need to come to a consensus about how we should move forward about brining Ratio and Proporiton to production with MediaPipe. I don't think there is anything else left for this issue. I'll assign @emily-phet to prioritize within her timeline for the summer.

zepumph commented 2 years ago

Notes from a naming brainstorm in status meeting:

From EM/TS/BF: Brainstorming name for feature where computer vision tracks hand gestures. We’ve been using handtracking, but that is actually ambiguous. Need something for website, etc., that is understandable to teachers/students and ideally can be used on the design and development side as well.

SR: I recommend against tying our name to the implementation library (such as media pipe).
SR: May be preferable to be more general (like Computer Vision or Camera Input) rather than more specific (like Hand Tracking) in case we expand capabilities later. (Unless we want to have one product around “hands” and happy naming another separate feature later if we have other vision-based solutions). We also already have a category called “Alternative Input”--this seems like a different kind of alternative input, maybe create a name that parallels that? @Sam - this makes a lot of sense to me now that you mention it. Maybe “Computer Vision” or something that indicates “camera needed” is most important, and second is what the camera is tracking. So something like “Computer Vision - Body” and later “Computer Vision - Objects” etc., which could be flexible for what is being tracked.

emily-phet commented 2 years ago

So, I've refreshed my memory and I think it should be one of the following:

Computer vision is the technical term, but I wonder of "Camera Tracking" might make more sense to a layperson, and also result in better outcomes when translated. The addition of a specific tracked object may be needed so that without it, people don't get confused and think the camera is just making use of everything, or it may introduce specificity that is not useful until we have other things being tracked...

I think I'll bring up just these options at status and get read on what issues/preferences come up.

terracoda commented 2 years ago

I think we need a final query parameter for publication.

terracoda commented 2 years ago

Was there a favorite determined in a recent Status meeting?

emily-phet commented 2 years ago

Was there a favorite determined in a recent Status meeting?

Status was cancelled last week, so I wasn't able to get input on it. I plan to bring it up next status meeting. If you have a preference, please feel free to weigh in here!

terracoda commented 2 years ago

I like "Camera Tracking" for the same reasons that you like it.

brettfiedler commented 2 years ago

From Status today:

From EM: Which name for computer vision feature that is tracking hand gestures in Ratio and Proportion:

“Computer Vision: Hands” (or just "Computer Vision") ** “Camera Tracking: Hands” (or just "Camera Tracking") * Hand Tracking ** Camera Input: Hands *****

Computer vision is the technical term, but I wonder if "Camera Tracking" might make more sense to a layperson, and also result in better outcomes when translated.

SR: What is the context where this will be described? On a website simulation filter? In a research article?

JG: I noticed UltraLeap and other AR companies call it “Hand Tracking”, would that fit for us too? 👍 BF: Depends if we go forward with something like OpenCV and want to differentiate/lump together

JG: Maybe we might have “Hand Tracking” and “Marker Tracking” separately. JG: For things driven with MediaPipe, maybe that should be in the title? (“MediaPipe Hands” - thats actually what Google calls it) I think probably in the description for sure. Some titles could include a longer description (Hand Tracking (MediaPipe))

JB: For me, “Hand Tracking” sounds quite clear and much less frightening than the other choices. +1 I have seen “Hand Tracking” used in several VR and AR APIs

zepumph commented 2 years ago

@terracoda, @emily-phet, and @zepumph really like "Camera Input: Hands".

And the query parameter will look like this: ?cameraInput=hands

This way we will be able to support multiple types of input, like in the future quad implementation of openCV with ?cameraInput=objectsWithGreenTapeOnThem.

This flexibility means we won't just use ?cameraInput as a flag.

zepumph commented 2 years ago

Alright. The query parameter has be renamed to ?cameraInput=hands. Many code types should likely keep their MediaPipe name, since it is so heavily tied to the implementation. The main goal here was to have a consistent and general public-facing layer. All other work is divided into sub issues. Closing