Closed brettfiedler closed 2 years ago
Updated text above! Let me know if you would like to meet to chat about anything @zepumph
I had some good luck with this script. After doing a bit of investigation, it will be easy to add to the sim, and it is really responsive, but I still have questions about how to bring it to production. I can't seem to figure out how to host all files locally (like by including them as preloads). For research, we just have to expect them to load from network connections. Its a bit less than 20MB each load, so it isn't negligible. I'll just start by adding it to a dev version under a query parameter.
I have a prototype working. Here are some questions and next steps:
Here is the patch to use the MediaPipe types in ratio and proportion. I can't commit this because it isn't hidden behind a query parameter just yet.
@terracoda and I had a couple of fun ideas about this today:
What if open hands means you can move them, and then you close the hands to "let go" of the ratio. This would give an end drag voicing response and a let go sound,.
Have you discussed the opposite? It may be more natural to grab the hands by grasping, then release them by releasing.
To discuss:
re: Tangible Input Tolerance widening for Quadrilateral: https://github.com/phetsims/quadrilateral/issues/116
Meeting notes:
NEXT STEPS:
@terracoda , would you have a moment before next Wednesday (4/13) to try out the hand tracking with Voicing and comment a bit on how it feels compared to its intended design? (if not, feel free to unassign)
To test, open the latest on phettest and append the mediaPipe
query parameter. You may also find showVideo
helpful, but it may reduce performance on your machine.
Related to Quadrilateral issue https://github.com/phetsims/quadrilateral/issues/116, we've discussed smoothing out the haptic experience when using tangibles or computer vision when using simulations that use tolerances (intervals around a target value that provide modal feedback). One way to accomplish this is to add "forgiveness" when a user achieves the target state, making it more difficult to break out of once they have entered it. This approach balances the need for precision to achieve the target state (necessary, especially for the current sims focused on mathematical concepts) and the need to account for human perception/mobility in dynamic movements. It also aims to address some feedback overload when near target values that trigger feedback (e.g., in-proportion sound) multiple times in a short time window.
For Ratio and Proportion, we wish this to be tied only to the case where the input device is a tangible (driven by microcontroller or similar) or computer vision (e.g., MediaPipe or OpenCV).
In general, we wish to, in every case, keep the more precise (smaller) tolerance interval default to the simulation whenever the simulation is not "in-proportion" (target state). If the simulation is in the target state, the tolerance interval should increase by some factor. Precision is still important and the factor should not increase such that there is significant play in with the hands while in the target state. This may require some small iteration.
In Ratio and Proportion, the dynamic, velocity-gated feedback already accounts for imprecise human movement and is sufficiently large to maintain "moving in proportion" feedback.
@zepumph noted that there are a few cases that may require special attention when the hands enter and leave detection and interaction begins or ends to decide what interval to use for the new hand positions:
For clarity, every tolerance interval biases the newest state of the sim given where the hands are. Beginning and ending with different tolerance intervals may introduce some strange behavior, but there is also likely to be a lot of movement as participants enter and leave the detection window anyway as they orient their hands and reorient their movements to align with the simulation. Feedback at these moments likely will not need to be precise. It's more important that the experience during continuous detection feel smooth. In other words, I don't think anyone is going to move carefully enough or remember exactly where there hands were the last time they were interacting with the simulation, even if we can "release the hands". How would they put their hands back in the exact same position to resume?
What is the query parameter fro media pipe?
@zepumph and I discussed on Friday some ideas to improve the Voicing. Connecting to https://github.com/phetsims/ratio-and-proportion/issues/454#issuecomment-1100222835
Originally, we thought that the hand tracking would be very similar to the voicing experience with mouse interaction (non-discrete input), but for hand tracking there is no end drag event. See https://github.com/phetsims/ratio-and-proportion/issues/454#issuecomment-1100222835 for active hand gestures that that might make the Voiced experience with hand tracking more flexible, and controllable, and inclusive.
After chatting with EM, it's clear we need a good understanding of the publication possibilities with MediaPipe before we can progress too far.
I imagine the initial options are something like: 1.) What we have right now, with a required internet connection on load and the mediaPipe load dependent on the mediaPipe
query parameter. 2.) A large download? for offline access?
Following that, it'd be great to have a discussion (with all interested parties) to determine what kinds of general gesture support we can build for PhET sims, beginning with Ratio and Proportion as a case study. By general, we mean moving focus and interacting with common UI components that aren't necessarily free moving (e.g., hands in RaP, balloons in BASE, magnet in Faraday's Law). We'll plan a meeting to discuss this when we determine if we can publish sims with Mediapipe.
I will get to this on Wednesday
Sorry for the delay here. To summarize the aspects of publishing a sim with MediaPipe, we can definitely move forward with this requiring internet connection. Over in https://github.com/phetsims/tangible/issues/9#issuecomment-1107072513 I again tried to get these files to be delivered offline, but failed. And so I feel like the best path is knowing that loading MediaPipe will download ~20MB of data files each run. This adds ~5 seconds to the load time.
I still do not think that fully offline access is impossible, but I think the best path forward there would be to bring in a senior dev for assistance (I think JO would be best). Within 1 hour of pairing with him we could determine a path forward, or if there is an impasse.
Please note that the 20MB size has to come from somewhere, whether it is bundled in the sim, or retrieved from MediaPipe's cdn each time the sim runs.
Other worries I have about internet connection is that we are tied to their API and decision-making. There is no versioning on any of the files we have, so I have no trust that in 3 years the links will keep the same files. That is the main desire for getting files bundled within the sim.
Following that, it'd be great to have a discussion (with all interested parties) to determine what kinds of general gesture support we can build for PhET sims, beginning with Ratio and Proportion as a case study. By general, we mean moving focus and interacting with common UI components that aren't necessarily free moving (e.g., hands in RaP, balloons in BASE, magnet in Faraday's Law). We'll plan a meeting to discuss this when we determine if we can publish sims with Mediapipe.
Yes I agree. With minimal effort I think that we could build this control into scenery proper based on the current mouse/touch support, and then add in custom controls as we want to (like for the hands). In my experience so far, I have complete trust in MediaPipe's ability to detect pretty complicated positions for hands that we can map into a combination of general controls and sim-specific interactions.
Let me know if you want more thoughts about this.
The meeting should include @emily-phet, right?
Thanks @zepumph!
Yes, let's arrange a meeting with the 4 of us (@zepumph, @terracoda, @emily-phet, and me) to discuss. I'll put out the feelers for a common time for gesture discussion!
I think pairing with JO or whoever is needed is definitely a good use of time to try for offline access. I agree versioning (for a library explicitly indicated as an "alpha") is likely to be of concern down the line.
@emily-phet, do you agree?
@BLFiedler @zepumph Yes, I agree with Brett to @zepumph suggestion, pairing with a senior dev should be one of the next steps.
If it's the case that there is no way around a sim with the kind of input we're talking about here (computer vision tracking naturalistic hand movement) weighing in at ~20mb, then there are some follow-up questions that need to be addressed.
The issues with larger file size include:
Some follow-up questions include:
I think the worst case scenario is the following - if there is no way to get the file size for a sim using MediaPipe into there realm of typical for a larger PhET sim, AND there is no reasonable mechanism on the PhET website for someone to choose between two versions of a sim, then we may put a lot of work into setting up the use of MediaPipe as a standard PhET feature, but cannot actually provide easy, intuitive access to sims with that feature, which will limit or completely stall uptake of the feature (by PhET broadly and by users).
We haven't posted here in a while, even though we all had a meeting just after the most recent comment. I have completed https://github.com/phetsims/tangible/issues/9, and MediaPipe is now entirely offline, and bundled with the sim (please note https://github.com/phetsims/ratio-and-proportion/issues/464). I think we need to come to a consensus about how we should move forward about brining Ratio and Proporiton to production with MediaPipe. I don't think there is anything else left for this issue. I'll assign @emily-phet to prioritize within her timeline for the summer.
Notes from a naming brainstorm in status meeting:
From EM/TS/BF: Brainstorming name for feature where computer vision tracks hand gestures. We’ve been using handtracking, but that is actually ambiguous. Need something for website, etc., that is understandable to teachers/students and ideally can be used on the design and development side as well.
SR: I recommend against tying our name to the implementation library (such as media pipe).
SR: May be preferable to be more general (like Computer Vision or Camera Input) rather than more specific (like Hand Tracking) in case we expand capabilities later. (Unless we want to have one product around “hands” and happy naming another separate feature later if we have other vision-based solutions). We also already have a category called “Alternative Input”--this seems like a different kind of alternative input, maybe create a name that parallels that?
@Sam - this makes a lot of sense to me now that you mention it. Maybe “Computer Vision” or something that indicates “camera needed” is most important, and second is what the camera is tracking. So something like “Computer Vision - Body” and later “Computer Vision - Objects” etc., which could be flexible for what is being tracked.
So, I've refreshed my memory and I think it should be one of the following:
Computer vision is the technical term, but I wonder of "Camera Tracking" might make more sense to a layperson, and also result in better outcomes when translated. The addition of a specific tracked object may be needed so that without it, people don't get confused and think the camera is just making use of everything, or it may introduce specificity that is not useful until we have other things being tracked...
I think I'll bring up just these options at status and get read on what issues/preferences come up.
I think we need a final query parameter for publication.
Was there a favorite determined in a recent Status meeting?
Was there a favorite determined in a recent Status meeting?
Status was cancelled last week, so I wasn't able to get input on it. I plan to bring it up next status meeting. If you have a preference, please feel free to weigh in here!
I like "Camera Tracking" for the same reasons that you like it.
From Status today:
From EM: Which name for computer vision feature that is tracking hand gestures in Ratio and Proportion:
“Computer Vision: Hands” (or just "Computer Vision") ** “Camera Tracking: Hands” (or just "Camera Tracking") * Hand Tracking ** Camera Input: Hands *****
Computer vision is the technical term, but I wonder if "Camera Tracking" might make more sense to a layperson, and also result in better outcomes when translated.
SR: What is the context where this will be described? On a website simulation filter? In a research article?
JG: I noticed UltraLeap and other AR companies call it “Hand Tracking”, would that fit for us too? 👍 BF: Depends if we go forward with something like OpenCV and want to differentiate/lump together
JG: Maybe we might have “Hand Tracking” and “Marker Tracking” separately. JG: For things driven with MediaPipe, maybe that should be in the title? (“MediaPipe Hands” - thats actually what Google calls it) I think probably in the description for sure. Some titles could include a longer description (Hand Tracking (MediaPipe))
JB: For me, “Hand Tracking” sounds quite clear and much less frightening than the other choices. +1 I have seen “Hand Tracking” used in several VR and AR APIs
@terracoda, @emily-phet, and @zepumph really like "Camera Input: Hands".
And the query parameter will look like this: ?cameraInput=hands
This way we will be able to support multiple types of input, like in the future quad implementation of openCV with ?cameraInput=objectsWithGreenTapeOnThem
.
This flexibility means we won't just use ?cameraInput
as a flag.
Alright. The query parameter has be renamed to ?cameraInput=hands
. Many code types should likely keep their MediaPipe name, since it is so heavily tied to the implementation. The main goal here was to have a consistent and general public-facing layer. All other work is divided into sub issues. Closing
We would like to implement a simple handtracking that maps the left and right hand vertical positions in Ratio and Proportion to the vertical height in the detection window using the machine learning library produced by MediaPipe and a device's camera which @samreid implemented as a demo for Build an Atom (https://github.com/phetsims/tangible/issues/7#issuecomment-936658746)
MediaPipe's page here has the Hand tracking as well as some other cool tracking abilities (possibly for our future consideration... the Pose tracking looks very interesting...)
https://google.github.io/mediapipe/
I highly recommend checking out the demo to see the limitations for left/right hand detection, how quickly your hands can move, what hand shapes are supported, and how close your hands can be and still be differentiated.
We'll need to decide how the sims handles loss of detection/occlusion, but whatever is simplest for the first iteration will be best.