qianqianwang68 / omnimotion

Apache License 2.0
2.07k stars 121 forks source link

Understanding Eq. 1 and 2 #7

Open tengyu-liu opened 1 year ago

tengyu-liu commented 1 year ago

Congratulations on achieving this great work! The demo and results are very impressive, and it has been a big hit! I really like the idea of using a quasi-3D representation and ignoring the ambiguities because they are not important to the problem.

I'm trying to understand Eq. 1 and 2 from the paper and can't understand why we use the same points in the source $x_i^k$ and target frame $x_j^k=\mathcal{T}_j^{-1}\circ\mathcal{T}_i(x_i^k)$ and hope I can get some clarifications.

In my understanding, if the points $x_j^k$ are the same points as $x_i^k$ in the canonical frame, then the occlusion relationship would not change across frames as the camera ray still passes through the same set of points in the same order. Since $\sigma_k$ is stored in $G$ and does not change across frames, I don't understand why OmniMotion can handle occlusions.

So my question is, why are we computing $x_j^k$ as $\mathcal{T}_j^{-1}\circ\mathcal{T}_i(x_i^k)$ instead of sampling from a new ray in $j$-th frame and map that to the same canonical space? Why does the model work so well despite $M_\theta$ cannot change the occlusion relationship?

qianqianwang68 commented 1 year ago

Hi Tengyu,

I'm not sure if I fully understand your confusion. I'll try to answer your questions, and let me know if you have further ones.

the occlusion relationship would not change across frames as the camera ray still passes through the same set of points in the same order

the occlusion relationship can change, as two different samples $x_j^m$ and $x_j^n$ may swap orders in frame $j$ (i.e., $x_i^m$ is closer than $x_i^n$ in frame $i$, but with deformation $x_j^m$ can become farther away than $x_j^m$ in frame $j$). And the reason that we always naturally sample points from near to far at $p_i$ in frame $i$ is because we want to compute the flow at the location of $p_i$.

I can try to give an example to explain why OmniMotion can handle occlusions. Let's say $p_i$ is occluded by some other surface in frame $j$ at $p_j$, which means $p_i$ should go to $p_j$ but is occluded. Let's assume that the corresponding surface for $p_i$ is $x_i^{n}$ ($\sigma_i^{n}$ is 1 and all other $\sigma$ on the ray are zero). And let's assume that the corresponding surface for $p_j$ in frame $j$ is $x_j^{l}$, then what happens in this case is that $x_i^{n}$ is mapped to $x_j^{n}$ (which projects into $p_j$), but it is farther away than $x_j^{l}$, and that's how it gets occluded. So occlusion happens when some other points exist in front of the points you are tracking. And the other points do not need to be among the points you sampled at $p_i$.

Why not sampling from a new ray in $j$-th frame and map that to the same canonical space?

This can also work if and only if the two points are cycle consistent (co-visible). But the loss in Eq. 2 can be applied to occluded points as well. We tried the idea of enforcing cycle-consistent points to be mapped to the same canonical space, but it didn't work very well. In fact, what you need is not only enforcing matching points to be closer but also non-matching points to be further away, otherwise a trivial solution would be to make the canonical space infinitely small. But we didn't find a version of this loss to work robustly well either.

Best, Qianqian

tengyu-liu commented 1 year ago

Please correct me if my understanding is wrong:

The occlusion relationship can change, as two different samples $x_j^m$ and $x_j^n$ may swap orders in the frame $j$

If $x_i^m$ is closer than $x_i^n$, that means $m\lt n$, and $T_m\cdot\alpha_m=1$ and $T_n\cdot\alpha_n=0$ for both the frames $i$ and $j$. This will not change the occlusion relationship even if the depth order changes between the two frames. Unless you re-order $x_j$ by depth.

it ($x_j^n$) is farther away than $x_j^l$, and that's how it gets occluded

Because both $x_j^l$ and $x_i^l$ map to the same point in the canonical volume, they would get exactly the same color and density right? Consider in a rigid scene, something that was not occluded in frame $i$ ($x_i^n$ is closer than $x_i^l$) is occluded in frame $j$ ($x_j^l$ is closer than $x_j^n$) due to only camera motion. The only way OmniMotion would works is that $x_i^l$ is already in the camera ray, hiding behind $x_i^n$ even though in reality the occluding object should not be in the camera ray in frame $i$. Is my understanding correct? I believe that this is why it is called a quasi-3D representation in the sense that it is geometrically incorrect but suits well for the dense tracking task.

boxraw-tech commented 1 year ago

@tengyu-liu thanks for asking these questions, I'm also trying to get my head around this.

Because both $x_j^l$ and $x_i^l$ map to the same point in the canonical volume, they would get exactly the same colour and density right?

My understanding is they don't necessarily get the same colour and density as $F_\theta$ is parameterised differently for each frame by $\psi_i$.

tengyu-liu commented 1 year ago

According to sections 4.1 and 4.2, I believe that $F\theta$ is independent of the frame. $M\theta$ is parameterised by $\phi_i$, which gives different $\mathcal{T}_i$ functions for different frames. Since both $x_i^l$ and $x_j^l$ map to the same point in the canonical volume, they are guaranteed to get the same color and density.

boxraw-tech commented 1 year ago

In section 4.3 it says

density and colour can be written as $(\sigma_k, ck) = F\theta (M_\theta(x^k_i ; \psi_i)) $

so it seems perfectly possible to get different colour and density for the same point in different frames.

qianqianwang68 commented 1 year ago

Hi Tengyu,

The first part is correct. However, I think there is some misunderstanding here:

The only way OmniMotion would work is that $x_i^l$ is already in the camera ray, hiding behind $x_i^n$ even though in reality the occluding object should not be in the camera ray in frame i.

I don't understand why that's the only way OmniMotion would work. $x_i^n$ and $x_i^l$ do not need to be on the same ray in frame $i$. They can be at different pixel locations and both of them can be visible. Let me give you an example using the online demo:

image

In this example, the blue point is occluded by the red point in the second image (let's assume they are at the same pixel location), but their corresponding pixel locations in the first image are different and both of them are visible.

qianqianwang68 commented 1 year ago

@boxraw-tech Tengyu is correct, if two local points map to the same point in the canonical volume, then they are guaranteed to get the same color and density.

unlockpowerofpixels commented 11 months ago

Hi @qianqianwang68 @tengyu-liu

I am still trying to wrap my head around the discussion. Could you guys help to clarify them for me, particularly

If $\mathbf{x}_i^{m}$ is closer than $\mathbf{x}_i^{n}$, that means m<n, and $T_m$ $\circ$ $\alpha_m$=1 and $T_n$ $\circ$ $\alpha_n$=0 for both the frames i and j. This will not change the occlusion relationship even if the depth order changes between the two frames. Unless you re-order $\mathbf{x}_j$ by depth.

Following your discussion, assuming m<n, the corresponding surface for $\mathbf{p}_i$ is $\mathbf{x}_i^n$, thus $\sigma_i^n$=1 and the rest is 0 ($\sigma_i^m$=0), isn't $T_m$ $\circ$ $\alpha_m = 0$ and $T_n$ $\circ$ $\alpha_n$=0 since $\alpha_m = 1 - \exp(0)$=0 and since $T_k$ considers the product over the $k-1$ points, thus $T_n$ $\circ$ $\alpha_n$ should be 0 as well.

serycjon commented 1 week ago

I think I have finally figured out the occlusions :). For a given point p_i in the first image, you always get the same positions in the canonical volume, same colors and densities. Then for the second image you do the "alpha compositing" to get a single point x_j (that is the 2D point p_j and its "depth"). You don't get the occlusion state yet. They don't mention how to get the occlusion state in the paper, but I think I have found it in the code. To get the occlusion, they project the p_j back to the cannonical space (constructing samples, projecting) to get the densities and thus the "depth" (i.e. something like picking the sample on the p_j ray with the biggest density gives you "depth"). Finally they compare this "depth" in the second image with the "depth" of the point p_i projected into the second image. So in the swing example, you first project the blue point on the lady into the second image to get the position on the swing frame + a "depth" prediction. Then you go backward from this position (into the cannonical frame to get densities and thus "depth") to check if the predicted "depth" is as expected (in this case it is not). The important think is that the red point does not matter at all. The occluder may not even be visible in the first image.