Closed yunhan-zhao closed 5 years ago
@yzhao520 In util/data_kitti.py and util/evaluation.py, you can find how we read the depth from RGB image. You can also store the kitti depth as np.uint16 to get higher accuracy. The preprocessing step is from monodepth.
For the depth prediction, which network did you train? The T2mode.py with gan loss is not stable. You need to judge the loss for different scale. The TaskModel.py is suitable for the depth prediction. If you just train the TaskModel.py, the predicted depth should not be like that.
The depth processing for NYU V2 is same to us.
I'm using TaskModel.py, which is also named 'supervised' mode in your code. I think this TaskModel seems is just a depth predictor and I have no idea why it is like this.
Just curious, you said store depth as np.uint16. Does this imply your depth values are all integers for both kiiti and NYU v2?
Here are some details that I have done in TaskModel training, I would very appreciate you can help me find some mistakes.
Image_source is NYU_training_rgb
, image_target is NYU_testing_rgb
; lab_souce is NYU_training_depth
, lab_target is NYU_testing_depth
Preprocess rgb and depth images as I stated above
Select ResNetGenerator in define_G
from network.py (default is Unet). Train for 30 epochs since the default is only 10.
Is 10 or 30 epoch enough for training? Also, are you aware of any side effects causing by ResNetGenerator?
In our ECCV work, the depth is stored as np.uint8. But when we attended the competition of KITTI depth predictor, the depth is stored as np.uint16.
How long you train the model for 30 epochs? Because our synthetic data is very large, so we just train for 10 or 20 epoches.
You data processing is same to us. Hence, I guess the problem comes from the model. The Unet copies the features from the previous layers and encoder layers. The performance of Unet is better than the performance of ResNet in this project.
My depth is stored as np.float32 values, which can make a difference I guess. So your depth values for indoor datasets are values like 0, 1, 2, 3, ..., 10 (before clipping)?
Just curious, how many real samples and synthetic samples do you guys use in training? I'm guessing this problem might also related to insufficient training samples. I think I'm using 9000 real rgb depth pairs for training at the moment (NYU V2).
When I'm training TaskModel, I think each epoch takes less than an hour (around 50 mins) to finish with 1080 Ti. I'm mostly playing with the TaskModel so I'm not using the synthetic data now. I'll switch to UNet and check its performance later.
The depth value range is [0, 255] and normalized to [-1, 1].
We have 122K image depth pairs for SUNCG synthetic dataset, and 21K image depth pairs for vKITTI datasets. That's why we just trained for 10 and 20 epoches.
I think you can first try the Unet structure, because the feature copying from previous layers and encoder layers is important for depth prediction.
Thanks for the clarification and suggestions. I'll close this issue for the moment.
Hi,
I have tried the UNet structure and it seems like the prediction is much better now, however, the predicted depth images are somehow blurry. I'm wondering this might because of the depth preprocessing.
Just curious, in the evaluation.py
how do you guys choose the max_depth in load_depth
function? It seems like the ground truth max depth value is always 10, while the maximum of predicted depth depends on the dataset. Where are these values come from? Are these values related depth preprocessing?
Also, when you say process in the depth images in the same way. Do you mean you exactly same or it just means you simply also normalize depth into [-1, 1]? If your depth is [0, 255], have you ever clipped depth values before feeding into networks?
We do the depth clipping at depth preprocessing. Follow the previous work, you can define the max depth in kitti as 80m, and max depth in nyuv2 8m. Then, for the value large then them, you can clip to that range [1-80m] and [0-8m].
In original kitti and nyu depth, there is some noise form the depth sensor. Do you use some algorithm to preprocess the depth?
I'm using some processed NYU dataset and the depth images have already been inpainted. The depth images that I'm currently using are np.float32 values between [0, 10.0]. If your depth values are [0, 255], how do you clip it to [0, 8.0]? It seems like you just have to divide by 255 and normalize it to [-1, 1].
Also, it seems like you have a min_depth and I'm wondering why not clip depth to [1.0, 8.0] instead of [0.0, 8.0] in NYU v2 dataset?
I can understand the idea of clipping the depth images to [0, 8.0m] in NYUv2 but why in evaluate.py
line 24, the maximum depth is 10? And a mask is also created to ignore some values. Can you somehow explain what's the intuition behind these in evaluation?
This is a sample of the depth prediction I got by running TaskModel. It's kind of blurry and I was expecting something sharper. So far, I'm not quite sure what could be the problem.
Hi,
I'm sort of new to depth prediction area so pls forgive me if my question is silly.
My first question is how do you preprocess depth images before feeding into the network (both NYU and Kitti)? I have noticed that the final activation layer is tanh so I'm guessing the depth samples are rescaled between -1 to 1? (I have found the tensor2im function in util.py but I'm still not quite sure the preprocessing step)
My second question is about the depth prediction replication. I got some depth predictions with task network on NYU v2 dataset. The final error is not very close to what's reported in the paper and it looks weird (shown below, left one is the ground truth and right one is the prediction). I'm wondering whether you might have some suggestions that why the output looks not as good as reported.
For this task network depth prediction (NYU v2), I did the following in training: