Closed sr229 closed 3 years ago
Do we need low latency?
I think we are making VTuber streaming stuff like that. If so, it is good to use kalman filters, and buffered inputs for more paralleling face tracking for stabilization.
However for frame interpolation, it seems definatly needed, because it may make possible to stream 4k60 of VTuber.
Do we need low latency?
I think we are making VTuber streaming stuff like that. If so, it is good to use kalman filters, and buffered inputs for more paralleling face tracking for stabilization.
However for frame interpolation, it seems definatly needed, because it may make possible to stream 4k60 of VTuber.
We definitely need these as landmark input by itself is actually very raw and rough by its own. @LeNitrous givr kalman filters a look?
I gave kalman filtering a look, and judging from how it was explained:
Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, containing statistical noise and other inaccuracies, and produces estimates of unknown variables that tend to be more accurate than those based on a single measurement alone, by estimating a joint probability distribution over the variables for each timeframe
We definitely need this, though I'm not sure if this is input level, frame level, or at the neural network level.
Re-assigned to AB2
Closing as direction for this project has changed towards an abstract and modular approach to puppet renderers.
As part of #28, we have discussed how raw data would result on jittery rough data, even if the neural network used is theoretically as precise as a human eye predicting the facial movements of the subject. To compensate for jittery input, we will implement a sort of lag-compensation algorithm.
Background
John Carmack's work with Latency Mitigation for Virtual Reality Devices (source) explains that the physical movement from the user's head up to the eyes is critical to the experience. While the document is designed mainly for virtual reality, one can argue the methodologies used to provide a seamless experience for virtual reality can be applied for a face tracking application, as face tracking like HMDs, are also very demanding "human-in-the-loop" interfaces.
Byeong-Doo Choi, et al.'s work with frame interpolation using a novel algorithm for motion prediction would enhance a target video's temporal resolution, by using Adaptive OBMC. Such frame interpolation techniques according to the paper has been proven to give better results than the current algorithms used for frame interpolation in the market.
Strategy
As stated on the background, there are many strategies we can perform lag compensation for such raw jittery input from prediction data from the neural network, it is limited to these two strategies:
Frame Interpolation by Motion Prediction
Byeong Doo-Choi, et al. achieves frame interpolation by the following:
According to their experiments, such method would produce better image quality for the interpolated frames, which is helpful for prediction in our neural network, however it comes with a cost of having to process the video at runtime, as the experiment is only done on pre-rendered video frames already.
View Bypass/Time Warping
John Carmack's work with reducing input latency for VR HMDs suggests a multitude of methods, one of them is View Bypass - a method achieved by getting a newer sampling of the input.
To achieve this, the input should be sampled once but can be used by both the simulation and the rendering task, thus reducing the latency for such. However, the input and the game thread must run in parallel and the programmer must be careful not to reference the game state otherwise it would cause a race condition.
Another method mentioned by Carmack is Time Warping, which he states that:
There are different methods of warping which is forward warping and reverse warping, and such warping methods can be used along with View Bypassing. However, the increased complexity for lag compensation of doing input with the main loop concurrently is possible as the input loop will be independent of the game state entirely.
Conclusion
Such strategies mentioned would allow us to have smoother experience, however, based on my personal analysis, I found that Carmack's solutions would be more feasible for a project of our scale. We simply don't have the team and the technical resources to do from-camera video interpolation as it would be computationally expensive to be implemented with minimal overhead.