qianqianwang68 / omnimotion

Apache License 2.0
2.07k stars 121 forks source link

question about formula (2) #23

Closed 2019EPWL closed 9 months ago

2019EPWL commented 10 months ago

hi, thank you for releasing this work code. I have a question about formula (2). For the points on the rays sampled in the camera space of frame I, when they are back-projected to frame j, they may not be in a straight line anymore, so why can pixel coordinates and corresponding color be obtained using the accumulation based on nerf?

2019EPWL commented 10 months ago

In other words, can I think that omnimotion is actually using nerf density to replace the attention score in transformer

qianqianwang68 commented 10 months ago

Hi, you are right that the points on a ray in frame i will no longer form a straight line in frame j. However, we are not doing accumulation in frame j, we are only doing accumulation in frame i where the ray is straight, so I believe the operation still makes sense. It might be a little confusing the fact that we are accumulating flows because flows are computed as the difference between locations in frame i and frame j, but remember that the accumulation is still done in frame i.

I'm not sure how strongly this is connected with attention in transformer, though.

2019EPWL commented 10 months ago

@qianqianwang68 Hi, you are right, what puzzles me is the flow prediction. Can i understand formula (2) in this way, in fact, if i replace all the j subscripts in formula (2) with i, then I calculate alpha compositing in frame I, and what I get is the coordinate x_i of a three-dimensional point on the surface in frame I, and then map it into frame j, The three-dimensional point coordinates x_j on the surface in frame j are obtained.

serycjon commented 10 months ago

the points on a ray in frame i will no longer form a straight line in frame j. However, we are not doing accumulation in frame j, we are only doing accumulation in frame i where the ray is straight, so I believe the operation still makes sense. It might be a little confusing the fact that we are accumulating flows because flows are computed as the difference between locations in frame i and frame j, but remember that the accumulation is still done in frame i.

Maybe the key is that during the alpha-compositing, the (T_k * alpha_k) coefficient is pretty much zero everywhere and it has non-zero value only for single index (the first visible surface along the ray in the left image)? So in fact the alpha compositing step could be seen as: take the first sample with nonzero sigma_i and project that to get x_j?

Could you elaborate more, or give some example if my reasoning is not correct? It seems to me that the other option is "averaging multiple possibly quite different flows / x_j^k" and that does not make sense to me.

qianqianwang68 commented 10 months ago

Hi @2019EPWL that's a great question! As you pointed out, there are actually two options that one can consider to get the mapping from a pixel in frame i to a 3D location in frame j. One is as in the paper, to first compute the mapping from i to j for all samples on the ray and then alpha-composite them. The other is to first compute the 3D location in frame i, and then map this single 3D location in frame i to frame j using the mapping network. We actually experimented with both and found that only with the former could we learn a meaningful mapping network. My intuition for this is simple: For the former, the mapping network receives gradients for all samples on the ray which is a denser signal and good for optimization, but for the latter the mapping network only receives gradients for a single point on the ray, which is much sparser and harder to train. In addition, at the beginning of training, since the density field hasn't converged, the alpha compositing doesn't yet produce meaningful 3D location, and the mapping network can then receive noisy signals which can hurt training.

qianqianwang68 commented 10 months ago

@serycjon Your understanding is correct. If we simplify the problem and regard all alpha_k to be either 0 or 1, then the result of (T_k * alpha_k) will be 1 for the first sample on the ray with alpha=1 and 0 for all other samples. In reality, alpha_k is continuous between 0 and 1, but we observe that during optimization, alpha_k always tends to become closer to either 0 or 1 (i.e., either transparent or opaque), especially for the regions that overlap with others in the video. You can observe this by looking at the visualization for weight_stats in the tensorboard. In addition to that, we also add a distortion loss at late optimization stage to further suppress floaters and encourage the field to be either opaque or transparent.

My intuition for this is similar to what you mentioned. If the field remains semi-transparent, the network will be "averaging multiple possibly quite different flows / x_j^k", which not only doesn't make sense intuitively but also makes it harder for the network to minimize the loss. Although the overall tendency of the density field is to become either fully opaque or transparent, we did also observe that sometimes the method can also take advantage of the softness of the blending process to fit to the noise and errors in the input flow data to further bring the loss down, which is undesirable.