pkhungurn / talking-head-anime-2-demo

Demo programs for the Talking Head Anime from a Single Image 2: More Expressive project.
http://pkhungurn.github.io/talking-head-anime-2/
MIT License
1.13k stars 155 forks source link

Improvements #2

Closed graphemecluster closed 3 years ago

graphemecluster commented 3 years ago

Improvements

dragonmeteor commented 3 years ago

Thank you for your suggestions. However, to keep the functionality minimal, I will not be accepting such a big pull request if you were to send one. It's probably the best to direct people to your fork instead.

graphemecluster commented 3 years ago

That's OK. I will mainly work on the first 2 points then. For the fourth point I am actually seeking for someone's help for the algorithm. This is also one of the reason why I open this issue.

graphemecluster commented 3 years ago

I edited my fork to minimize the alternations to this repository and I would like to open a pull request because there are some mistakes found (I am not sure if they are intentional, though) and I think it is worthy of automatizing the process of alternating all pixels of (n, n, n, 0) to (0, 0, 0, 0).

Additionally, I have read some part of your article and you’ve mentioned about keypoint-based tracking. Although a direct mapping may not be possible, I wonder if you have studied in converting a set of facial landmarks returned by dlib to your poser’s parameter set. I think it is a simple task but I do not have enough ability to competent to this job.

dragonmeteor commented 3 years ago

First, if you submit a pull request with the first two improvements, I will work with you to incorporate them into the main repository. I think they are simple enhancements that make the software easier to use.

As you might have already observed, the dlib landmarks are quite enough to determine how open/close the eyes and the mouths are, so I used them in the much simpler Version 1 of the software.

Can we determine the rest of the parameters from them? I don't think so. The landmarks simply do not have enough information. For example, it does not tell you where the irises are, so you cannot determine the iris parameters from them.

Even if you abandon the iris parameters, I also think determining other parameters would be hard. The iPhone needs an RGB capture and probably a depth image to infer the 52 blendshape parameters. Works that determine the Action Units such as EmotioNet use both shape (landmark) and shading (image) cues. The landmarks, on the other hand, give you only the shape, but not the shading.

dragonmeteor commented 3 years ago

In the end, I think the problem you want to solve is how to control characters without having to have an iPhone. To do this, you need to be able to determine pose parameters from a video feed instead of dlib landmarks.

Guess what, I tried to solve this problem, but I gave up. I was thinking that I could leverage some free tools to do it. The most promising seemed to be OpenFace 2, which outputs a number of Action Units. Nonetheless, OpenFace 2 does not give me any information about the irises, so I decided to supplement it with outputs from the MediaPipe Iris and MediaPipe Face Mesh models.

At that point, I thought I could cook up some simple formulas to reliably determine the parameters from all the outputs. I was wrong. OpenFace 2 was not at all reliable. The face mesh deformed in weird ways when the face was not looking straight ahead. I also had a hard time determining how open/close the eyes and the mouth were from OpenFace 2's landmarks and the face mesh to the point that I had to revert back to dlib landmarks to do the jobs. Nevertheless, dlib landmarks were also unstable, and the simple algorithm I used in Version 1 broke down when the face moved away from the camera. To get even passable results, I had to specify extra parameters for every video I tried to process. (Here is one such passable result: https://twitter.com/dragonmeteor/status/1329157949156061186/video/1.)

At one point, I realized that trying to solve this problem was not worth my effort. The problem I really want to solve is how to generate animations, not how to determine pose parameters from a video feed. So, I bought an iPhone and a copy of iFacialMocap, and I was able to quickly cook up simple formulas for the parameters from the blendshape parameters that it outputted. I think this decision was a good one.

So, what to do with the rest of the parameters? I don't know either. If I knew, the project would have contained many more demonstration videos. You have to do your own research. Please let me know if you are successful.

graphemecluster commented 3 years ago

Sorry for talking such a digressive problem and asking so much of you. I should have known that you have made great effort on it. I am glad that you shared your experiences on this problem. Thank you for reading and writing such a big essay. I must apologize if I did cause any inconvenience. I am sorry to bother you and please forgive my selfishness. I will try my best to work on it when I have enough ability to do so.