google-ai-edge / mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
https://mediapipe.dev
Apache License 2.0
26.77k stars 5.09k forks source link

Questions about 3D Hand Model and Dataset #851

Closed pablovela5620 closed 4 years ago

pablovela5620 commented 4 years ago

Hi All I had a few questions regarding both dataset used to train and the model training since a new model recently released removing the 2d tflite and leaving only the 3d tflite Dataset: The documentation states that around 30k real images were used to train hand model, as well as synthetic data. Issue #284 has @mgyong clarify that only the synthetic dataset was involved for the z depth prediction. Questions

  1. Is the only supervision used to train the z depth based on synthetic data for the latest model? How many synthetic images were used?
  2. Do the real images used for training not contain depth information?
  3. If they do, what was the procedure for collecting this data? depth camera? multiview camera setup?
  4. Is the main goal of this model/dataset to accurately track 2d points, with the small benefit of having z depth that isn't as accurate?
  5. Issue #284 also states that for privacy reasons you would be unable to release the dataset. On that point, would it be possible to release the SYNTHETIC dataset since no privacy issues would arise from this?
  6. The current output of the model involves 21x3 keypoints, handedness, and confidence on if there is a hand there or not. How was the confidence value supervised? This may be an ignorant question but where non-hand images fed to the network so that it would learn when a hand is present or not?
  7. Is it possible to output per joint confidences?
  8. What was the choice between directly regressing the keypoints v.s. regressing a heatmap?
  9. Could you elaborate more on the mixed training schema for using both synthetic and real images to train simultaneously?
  10. Lastly, are there any plans to improve the z depth values?

I greatly appreciate any answers on this topic, I know that is a ton of question so even just a few responses would be appreciated. This repo has been incredibly helpful and I am grateful that it has been open-sourced! Thank you again

mgyong commented 4 years ago

1) Is the only supervision used to train the z depth based on synthetic data for the latest model? How many synthetic images were used?

Yes, 19122 synthetic images were used

2) Do the real images used for training not contain depth information?

No. Real images that we are using now doesn't contain any depth information

3) Is the main goal of this model/dataset to accurately track 2d points, with the small benefit of having z depth that isn't as accurate? Issue #284 also states that for privacy reasons you would be unable to release the dataset. On that point, would it be possible to release the SYNTHETIC dataset since no privacy issues would arise from this?

This is what we have, but I am not sure that it's our goal. We're unable to release SYNTHETIC dataset because of the LICENSE for the 3d model which we're using for building it

4) The current output of the model involves 21x3 keypoints, handedness, and confidence on if there is a hand there or not. How was the confidence value supervised? This may be an ignorant question but where non-hand images fed to the network so that it would learn when a hand is present or not?

Yes. (I assumed that instead of "where" the author meant "were")

5) Is it possible to output per joint confidences?

It's possible. Similar to the paper "Multi-Task Learning as Multi-Objective Optimization" For more details on hands, see MediaPipe Hands: On-device Real-time Hand Tracking

6) What was the choice between directly regressing the keypoints v.s. regressing a heatmap?

Like it stated in the blogpost A common alternative approach is to predict a 2D heatmap for each landmark, but it is not amenable to depth prediction and has high computational costs for so many points.

7) Could you elaborate more on the mixed training schema for using both synthetic and real images to train simultaneously?

The images from both datasets are combined together in one batch and for landmarks where z value is not present the loss is not calculated hence the weights that predict z coordinate are not updated.

Thanks @ablavatski https://github.com/ablavatski and @wakoan https://github.com/wakoan

pablovela5620 commented 4 years ago

Perfect, this is exactly what I was looking for! I had one last question, I know that currently the palm estimation module is BlazePalm, and I recently saw that the Google Research team released a paper on BlazePose. Are there plans to release this on mediapipe?

Thank you again for taking the time to answer all of my questions! This is extremely helpful

mgyong commented 4 years ago

We plan to release some version of BlazePose but timeline is not confirmed

pablovela5620 commented 4 years ago

Great, thank you

pablovela5620 commented 4 years ago

Also it would be extremely helpful to have MediaPipe Hands: On-device Real-time Hand Tracking somewhere in the documentation (maybe under resources?). It has a lot of useful info and I had no clue the team had released it to arxiv

RamanHacks commented 3 years ago

@mgyong If you can share, which 3d software are you using to build synthetic datasets?