huang-yh / SelfOcc

[CVPR 2024] SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction
Apache License 2.0
273 stars 17 forks source link

How to compute L_dep by only the current frame? #1

Closed Fxmangd closed 8 months ago

Fxmangd commented 9 months ago

Dear authors,

First, please let me thank you for your great work. I saw in your paper that only the current frame was used for depth estimation and you activate L_dep for depth estimation, as mentioned below:

We activate temporal supervision for 3D occupancy prediction and novel view synthesis which both involve scene reconstruction from multiple viewpoints, while we use only the current frame as supervision for depth estimation since it predicts the geometry only for the input view. As for loss formulation, we use L_dep , L_E and an edge loss L_edg [22] for depth estimation since RGB supervision from the input view is meaningless, while L_H and L_s focus on improving the overall geometry, thus helpless for depth estimation.

As stated in the paper, L_dep is formulated by taking the minimum of two L_mvss using the previous image I_t−1 or the next image I_t+1 as the source image, may I ask if the previous image here refers to the previous image in the temporal sequence, or is it the previous perspective in the panoramic view? As far as I know, there is a relatively small overlap in the panoramic perspective in nuscenes, so I have some doubts about this.

Thank you very much in advance.

huang-yh commented 9 months ago

Thank you for your interest!

The previous image here refers to the previous frame in the temporal sequence. And the reason for this choice is just as you pointed out that there is little overlap of FOV between spatially adjacent images. However, other work for depth estimation like SurroundDepth takes this spatial correlation into consideration explicitly. For us, we treat this problem similar to monocular depth estimation.

Fxmangd commented 9 months ago

Thank you very much for your reply.

I have a small question about another area. You presented the ablation experimental results under different temporal supervision settings in your paper, as shown below: Screenshot from 2023-11-28 10-19-17 May I ask if the result shown in the first line, where both prev and next are set to 0, is the result without using temporal supervision, or in other words, is this result obtained without L_dep?

huang-yh commented 9 months ago

Not exactly. We use L_dep to supervise the depth of the current frame in the first line.

In the other lines, nonzero values under Prev. and Next mean that we also supervise the depth of temporal frames.

Fxmangd commented 9 months ago

Thank you very much for your patient answer. My question has been completely resolved.

May I ask if you could reveal more about the evaluation criteria for depth estimation, as occupancy can only cover a range of 80m in BEV, but the depth generated by projection of LiDAR points may exceed this range? Did you use a mask when evaluating depth?

huang-yh commented 9 months ago

Absolutely. We use TPV to represent an area of [-51.2, -51.2, -4.0, 51.2, 51.2, 5.0] around the ego car for nuScenes. However, we still follow the conventional evaluation protocol for depth estimation which evaluates depth in the range of [-80, 80].

Btw, we are going to release the evaluation code soon.

Fxmangd commented 9 months ago

Thank you very much for your answer once again, and looking forward to your upcoming code release