Implement handtracking for left and right hand vertical positions using MediaPipe

brettfiedler commented 2 years ago

We would like to implement a simple handtracking that maps the left and right hand vertical positions in Ratio and Proportion to the vertical height in the detection window using the machine learning library produced by MediaPipe and a device's camera which @samreid implemented as a demo for Build an Atom (https://github.com/phetsims/tangible/issues/7#issuecomment-936658746)

MediaPipe's page here has the Hand tracking as well as some other cool tracking abilities (possibly for our future consideration... the Pose tracking looks very interesting...)

https://google.github.io/mediapipe/

I highly recommend checking out the demo to see the limitations for left/right hand detection, how quickly your hands can move, what hand shapes are supported, and how close your hands can be and still be differentiated.

We'll need to decide how the sims handles loss of detection/occlusion, but whatever is simplest for the first iteration will be best.

brettfiedler commented 2 years ago

Updated text above! Let me know if you would like to meet to chat about anything @zepumph

zepumph commented 2 years ago

I had some good luck with this script. After doing a bit of investigation, it will be easy to add to the sim, and it is really responsive, but I still have questions about how to bring it to production. I can't seem to figure out how to host all files locally (like by including them as preloads). For research, we just have to expect them to load from network connections. Its a bit less than 20MB each load, so it isn't negligible. I'll just start by adding it to a dev version under a query parameter.

zepumph commented 2 years ago

I have a prototype working. Here are some questions and next steps:

[x] Is downloading the package on runtime load ok? see https://github.com/phetsims/ratio-and-proportion/issues/431
[x] What to do if only a single hand is detected, right now I require both so that their positions can be sorted and you can determine which one we care about. Perhaps we can use left/right hands to make this better? That could be problematic depending on if your webcam flips its image or not.
[x] How best to hide this behind a query parameter. I am interested in dynamically adding these scripts and the video element with javascript, but I may want to use asyncLoader to keep the splashscreen hiding stuff throughout the load. I also found that MediaPipe.initialize (my own function) actually kicks off a bunch of async downloading, so it may be challenging to capture all of that behind a splashscreen.
[x] Cue arrows, isBeingInteractedWithProperty, other sim-specific junk.

Here is the patch to use the MediaPipe types in ratio and proportion. I can't commit this because it isn't hidden behind a query parameter just yet.

```diff Index: ratio-and-proportion_en.html IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP <+>UTF-8 =================================================================== diff --git a/ratio-and-proportion_en.html b/ratio-and-proportion_en.html --- a/ratio-and-proportion_en.html (revision df990376fb927d060ca3b0f382d835bdfcd2df07) +++ b/ratio-and-proportion_en.html (date 1646438680693) @@ -9,10 +9,15 @@ ratio-and-proportion + + + + +

```

zepumph commented 2 years ago

[ ] Oh I think there was one more feature I was going to add, which was an option to "narrow" the range of the webcam that equates to the hands, such that you could be just the middle 80% of the height or something. I'll take a look at doing that too, but you can still review in parallel.

zepumph commented 2 years ago

@terracoda and I had a couple of fun ideas about this today:

What if open hands means you can move them, and then you close the hands to "let go" of the ratio. This would give an end drag voicing response and a let go sound,.
What if when the ratio was locked, you only needed one hand to move the ratio around?
What if we disabled the ratio lock checkbox when media pipe is turned on, we don't really need it anyways.

samreid commented 2 years ago

What if open hands means you can move them, and then you close the hands to "let go" of the ratio. This would give an end drag voicing response and a let go sound,.

Have you discussed the opposite? It may be more natural to grab the hands by grasping, then release them by releasing.

brettfiedler commented 2 years ago

To discuss:

Approach to tangible input with tolerances (general to RaP and Quadrilateral) for biasing important feedback points
Noise filtering (currently averaging)
Continued investigation into needs of MediaPipe in sims for later publication (sim size, etc)
Hand gesture capabilities & incomplete landmark detection handling

brettfiedler commented 2 years ago

re: Tangible Input Tolerance widening for Quadrilateral: https://github.com/phetsims/quadrilateral/issues/116

zepumph commented 2 years ago

Meeting notes:

Approach to tangible input with tolerances (general to RaP and Quadrilateral) for biasing important feedback points
- Unsupported Human movement vs. Model Tolerances
  - https://github.com/phetsims/quadrilateral/issues/116
  - Widen tolerances for TANGIBLE when breaking out of ratio
  - 8 cases to manage when entering and exiting media pipe interaction:
    - Start interaction:
      - Was:
        
        In proportion
        
        Out of proportion
      - Now:
        
        In proportion
        
        Out of proportion
    - Same for ending interaction.
    - Tools:
      - Ignore things like sound/voicing etc
      - “Snap” to location
      - Do nothing
      - Something else?
- Nonphysical Values (e.g., jumping values that a hand couldn’t feasibly do)
  - Not applicable with MediaPipe?
- Jitter
  - Noise filtering (currently averaging. BF: I think this is fine as-is for now. I don’t feel any cognitive dissonance)
Continued investigation into needs of MediaPipe in sims for later publication (sim size, etc)
- A nice list of our options and any pros or cons that we could bring to the PhET Designers and/or Developers
Hand gesture capabilities & incomplete landmark detection handling
- VERY ACCURATE says MK
We could just add a controller where you, e.g., pinch and move a mouse cursor, FOR EVERY SIMULATION
- This brings us back to the publication question.

NEXT STEPS:

Brett:
- Brett talks to EM about publication and gesture needs
- Brett makes explicit requests in Github for things for MK to work on
  - (tolerance related)
  - Need to review Voicing w/ TS
MK
- Implement tolerance when interacting with a computer-vision input modality.

brettfiedler commented 2 years ago

@terracoda , would you have a moment before next Wednesday (4/13) to try out the hand tracking with Voicing and comment a bit on how it feels compared to its intended design? (if not, feel free to unassign)

To test, open the latest on phettest and append the mediaPipe query parameter. You may also find showVideo helpful, but it may reduce performance on your machine.

brettfiedler commented 2 years ago

Related to Quadrilateral issue https://github.com/phetsims/quadrilateral/issues/116, we've discussed smoothing out the haptic experience when using tangibles or computer vision when using simulations that use tolerances (intervals around a target value that provide modal feedback). One way to accomplish this is to add "forgiveness" when a user achieves the target state, making it more difficult to break out of once they have entered it. This approach balances the need for precision to achieve the target state (necessary, especially for the current sims focused on mathematical concepts) and the need to account for human perception/mobility in dynamic movements. It also aims to address some feedback overload when near target values that trigger feedback (e.g., in-proportion sound) multiple times in a short time window.

For Ratio and Proportion, we wish this to be tied only to the case where the input device is a tangible (driven by microcontroller or similar) or computer vision (e.g., MediaPipe or OpenCV).

Target State: In Proportion

In general, we wish to, in every case, keep the more precise (smaller) tolerance interval default to the simulation whenever the simulation is not "in-proportion" (target state). If the simulation is in the target state, the tolerance interval should increase by some factor. Precision is still important and the factor should not increase such that there is significant play in with the hands while in the target state. This may require some small iteration.

Dynamic Target State: Moving in Proportion

In Ratio and Proportion, the dynamic, velocity-gated feedback already accounts for imprecise human movement and is sufficiently large to maintain "moving in proportion" feedback.

@zepumph noted that there are a few cases that may require special attention when the hands enter and leave detection and interaction begins or ends to decide what interval to use for the new hand positions:

Start interaction tolerance values: (Pre-interaction -> Start of interaction; In Proportion (IP), Out of Proportion (OP))

IP1 -> IP1: Standard interval (no factor)
IP1 -> IP2: Standard interval (no factor)
OP -> OP: Standard interval (no factor)
IP1 -> OP: Standard interval (no factor)
OP -> IP(any): Standard interval (no factor)

End interaction tolerance values: (Interacting -> End of interaction; In Proportion (IP), Out of Proportion (OP))

IP1 -> IP1: Standard interval (no factor)
IP1 -> IP2: Standard interval (no factor)
OP -> OP: Standard interval (no factor)
IP1 -> OP: Standard interval (no factor)
OP -> IP(any): Standard interval (no factor)

For clarity, every tolerance interval biases the newest state of the sim given where the hands are. Beginning and ending with different tolerance intervals may introduce some strange behavior, but there is also likely to be a lot of movement as participants enter and leave the detection window anyway as they orient their hands and reorient their movements to align with the simulation. Feedback at these moments likely will not need to be precise. It's more important that the experience during continuous detection feel smooth. In other words, I don't think anyone is going to move carefully enough or remember exactly where there hands were the last time they were interacting with the simulation, even if we can "release the hands". How would they put their hands back in the exact same position to resume?

terracoda commented 2 years ago

What is the query parameter fro media pipe?

terracoda commented 2 years ago

@zepumph and I discussed on Friday some ideas to improve the Voicing. Connecting to https://github.com/phetsims/ratio-and-proportion/issues/454#issuecomment-1100222835

terracoda commented 2 years ago

Originally, we thought that the hand tracking would be very similar to the voicing experience with mouse interaction (non-discrete input), but for hand tracking there is no end drag event. See https://github.com/phetsims/ratio-and-proportion/issues/454#issuecomment-1100222835 for active hand gestures that that might make the Voiced experience with hand tracking more flexible, and controllable, and inclusive.

brettfiedler commented 2 years ago

After chatting with EM, it's clear we need a good understanding of the publication possibilities with MediaPipe before we can progress too far.

[x] @zepumph, can you summarize your prior findings on the needs for loading MediaPipe and, if needed, spend some time determining what our options are? We'd like to lay out as many details as possible for each case, e.g., to the best of our ability, determine when and how much of an internet connection is required, what size of download may be required to have offline access, or what performance optimization options are available to us in the library.

I imagine the initial options are something like: 1.) What we have right now, with a required internet connection on load and the mediaPipe load dependent on the mediaPipe query parameter. 2.) A large download? for offline access?

Following that, it'd be great to have a discussion (with all interested parties) to determine what kinds of general gesture support we can build for PhET sims, beginning with Ratio and Proportion as a case study. By general, we mean moving focus and interacting with common UI components that aren't necessarily free moving (e.g., hands in RaP, balloons in BASE, magnet in Faraday's Law). We'll plan a meeting to discuss this when we determine if we can publish sims with Mediapipe.

zepumph commented 2 years ago

I will get to this on Wednesday

zepumph commented 2 years ago

Sorry for the delay here. To summarize the aspects of publishing a sim with MediaPipe, we can definitely move forward with this requiring internet connection. Over in https://github.com/phetsims/tangible/issues/9#issuecomment-1107072513 I again tried to get these files to be delivered offline, but failed. And so I feel like the best path is knowing that loading MediaPipe will download ~20MB of data files each run. This adds ~5 seconds to the load time.

I still do not think that fully offline access is impossible, but I think the best path forward there would be to bring in a senior dev for assistance (I think JO would be best). Within 1 hour of pairing with him we could determine a path forward, or if there is an impasse.

Please note that the 20MB size has to come from somewhere, whether it is bundled in the sim, or retrieved from MediaPipe's cdn each time the sim runs.

Other worries I have about internet connection is that we are tied to their API and decision-making. There is no versioning on any of the files we have, so I have no trust that in 3 years the links will keep the same files. That is the main desire for getting files bundled within the sim.

Following that, it'd be great to have a discussion (with all interested parties) to determine what kinds of general gesture support we can build for PhET sims, beginning with Ratio and Proportion as a case study. By general, we mean moving focus and interacting with common UI components that aren't necessarily free moving (e.g., hands in RaP, balloons in BASE, magnet in Faraday's Law). We'll plan a meeting to discuss this when we determine if we can publish sims with Mediapipe.

Yes I agree. With minimal effort I think that we could build this control into scenery proper based on the current mouse/touch support, and then add in custom controls as we want to (like for the hands). In my experience so far, I have complete trust in MediaPipe's ability to detect pretty complicated positions for hands that we can map into a combination of general controls and sim-specific interactions.

Let me know if you want more thoughts about this.

terracoda commented 2 years ago

The meeting should include @emily-phet, right?

brettfiedler commented 2 years ago

Thanks @zepumph!

Yes, let's arrange a meeting with the 4 of us (@zepumph, @terracoda, @emily-phet, and me) to discuss. I'll put out the feelers for a common time for gesture discussion!

I think pairing with JO or whoever is needed is definitely a good use of time to try for offline access. I agree versioning (for a library explicitly indicated as an "alpha") is likely to be of concern down the line.

@emily-phet, do you agree?

emily-phet commented 2 years ago

@BLFiedler @zepumph Yes, I agree with Brett to @zepumph suggestion, pairing with a senior dev should be one of the next steps.

If it's the case that there is no way around a sim with the kind of input we're talking about here (computer vision tracking naturalistic hand movement) weighing in at ~20mb, then there are some follow-up questions that need to be addressed.

The issues with larger file size include:

some people's internet access cost is based on data transfer amount, this is particularly relevant in low-tech resource areas. We don't want to have to require someone to download a significantly larger than needed file.
loading time - (less significant than the previous, but not trivial either) There's probably a duration that most people are willing to wait without being updated that something positive is happening. On the order of 5 seconds may be pushing it. We wouldn't want people to think something is wrong with a longer download, and the current loading screen isn't optimized to indicate progress during longer than typical download times.

Some follow-up questions include:

what are the range of file sizes for existing sims? What's the biggest we have currently? If larger PhET sims are on the order of 75+% of a RaP with MediaPipe, maybe 20mb isn't that big of a deal, but if it's less than that, we'll need to address the following questions.
Is there a mechanism for a published sim to opt into different versions (e.g., selecting a version requiring a smaller or larger download size). If there is not, is there a path to that? The only existing method I can think of is something like Circuit Construction Kit which has many variants with their own pages treated as their own sims (at least, from the website and user perspective). So that may ultimately be our only option, having a RaP (regular) and RaP (w/ MediaPipe) (but labeled with cooler more understandable name about computer vision + hand gestures.

I think the worst case scenario is the following - if there is no way to get the file size for a sim using MediaPipe into there realm of typical for a larger PhET sim, AND there is no reasonable mechanism on the PhET website for someone to choose between two versions of a sim, then we may put a lot of work into setting up the use of MediaPipe as a standard PhET feature, but cannot actually provide easy, intuitive access to sims with that feature, which will limit or completely stall uptake of the feature (by PhET broadly and by users).

zepumph commented 2 years ago

We haven't posted here in a while, even though we all had a meeting just after the most recent comment. I have completed https://github.com/phetsims/tangible/issues/9, and MediaPipe is now entirely offline, and bundled with the sim (please note https://github.com/phetsims/ratio-and-proportion/issues/464). I think we need to come to a consensus about how we should move forward about brining Ratio and Proporiton to production with MediaPipe. I don't think there is anything else left for this issue. I'll assign @emily-phet to prioritize within her timeline for the summer.

zepumph commented 2 years ago

Notes from a naming brainstorm in status meeting:

From EM/TS/BF: Brainstorming name for feature where computer vision tracks hand gestures. We’ve been using handtracking, but that is actually ambiguous. Need something for website, etc., that is understandable to teachers/students and ideally can be used on the design and development side as well.

Gesture Control
Embodied input
Embodied gesture input
Gesture Controlled
Camdrive TM
Visinput
Insight
InSight
Cam Drive
VisiDrive
PhET Vision
Camera Control*
HandIO
Hands control a Sim!™
Hand Input
Hand Control
Hand Responsive
CompuHand
InputHand
PhET Hands (read like “jazz hands”)
HandGesture
HandJazz
Computer vision hand gesture
Camera Gesture
Live vision
Hand vision
MK: Is this specific to hands only? So this doesn’t apply to Quad openCV stuff?
SR: If we add a separate feature for control with eye/face tracking, would that have a different name?
Hand Motion
Hand Tracking (like https://www.ultraleap.com/tracking/)
May the Phorce be with you (I’m sorry)
Applause
Kinetic Sims *
PhET Kinetics
Camera Kinetics
Computer Vision - Kinetics
MoveTrack
MoTrack
Dance Input
TrackMo
Computer Vision *
GestiTrack, Gesti-Input
GestiDrive
Embodied Maker Input
Embodied Interaction
Customizable Embodied Input
PhET on the Move
PhET Watch
Computer Vision Control

SR: I recommend against tying our name to the implementation library (such as media pipe).
SR: May be preferable to be more general (like Computer Vision or Camera Input) rather than more specific (like Hand Tracking) in case we expand capabilities later. (Unless we want to have one product around “hands” and happy naming another separate feature later if we have other vision-based solutions). We also already have a category called “Alternative Input”--this seems like a different kind of alternative input, maybe create a name that parallels that? @Sam - this makes a lot of sense to me now that you mention it. Maybe “Computer Vision” or something that indicates “camera needed” is most important, and second is what the camera is tracking. So something like “Computer Vision - Body” and later “Computer Vision - Objects” etc., which could be flexible for what is being tracked.

emily-phet commented 2 years ago

So, I've refreshed my memory and I think it should be one of the following:

Computer Vision: Hands (or just "Computer Vision")
Camera Tracking: Hands (or just "Camera Tracking")

Computer vision is the technical term, but I wonder of "Camera Tracking" might make more sense to a layperson, and also result in better outcomes when translated. The addition of a specific tracked object may be needed so that without it, people don't get confused and think the camera is just making use of everything, or it may introduce specificity that is not useful until we have other things being tracked...

I think I'll bring up just these options at status and get read on what issues/preferences come up.

terracoda commented 2 years ago

I think we need a final query parameter for publication.

terracoda commented 2 years ago

Was there a favorite determined in a recent Status meeting?

emily-phet commented 2 years ago

Was there a favorite determined in a recent Status meeting?

Status was cancelled last week, so I wasn't able to get input on it. I plan to bring it up next status meeting. If you have a preference, please feel free to weigh in here!

terracoda commented 2 years ago

I like "Camera Tracking" for the same reasons that you like it.

brettfiedler commented 2 years ago

From Status today:

From EM: Which name for computer vision feature that is tracking hand gestures in Ratio and Proportion:

“Computer Vision: Hands” (or just "Computer Vision") ** “Camera Tracking: Hands” (or just "Camera Tracking") * Hand Tracking ** Camera Input: Hands *****

Computer vision is the technical term, but I wonder if "Camera Tracking" might make more sense to a layperson, and also result in better outcomes when translated.

SR: What is the context where this will be described? On a website simulation filter? In a research article?

JG: I noticed UltraLeap and other AR companies call it “Hand Tracking”, would that fit for us too? 👍 BF: Depends if we go forward with something like OpenCV and want to differentiate/lump together

JG: Maybe we might have “Hand Tracking” and “Marker Tracking” separately. JG: For things driven with MediaPipe, maybe that should be in the title? (“MediaPipe Hands” - thats actually what Google calls it) I think probably in the description for sure. Some titles could include a longer description (Hand Tracking (MediaPipe))

JB: For me, “Hand Tracking” sounds quite clear and much less frightening than the other choices. +1 I have seen “Hand Tracking” used in several VR and AR APIs

zepumph commented 2 years ago

@terracoda, @emily-phet, and @zepumph really like "Camera Input: Hands".

And the query parameter will look like this: ?cameraInput=hands

This way we will be able to support multiple types of input, like in the future quad implementation of openCV with ?cameraInput=objectsWithGreenTapeOnThem.

This flexibility means we won't just use ?cameraInput as a flag.

zepumph commented 2 years ago

Alright. The query parameter has be renamed to ?cameraInput=hands. Many code types should likely keep their MediaPipe name, since it is so heavily tied to the implementation. The main goal here was to have a consistent and general public-facing layer. All other work is divided into sub issues. Closing

phetsims / ratio-and-proportion