Running prediction without ground truth

Hi,

I wanted to use your code to run predictions on a custom dataset. I generated the required input data in terms of semantic masks, optical flow, and depth maps. The issue is in the object ground truth pose. In your tracking.cpp file, if there is no ground truth for the object, the iteration is skipped and no speed estimation is done. if (!bCheckGT1 || !bCheckGT2) { cout << "Found a detected object with no ground truth motion! ! !" << endl; mCurrentFrame.bObjStat[i] = false; mCurrentFrame.vObjMod_gt[i] = cv::Mat::eye(4,4, CV_32F); mCurrentFrame.vObjMod[i] = cv::Mat::eye(4,4, CV_32F); mCurrentFrame.vObjCentre3D[i] = (cv::Mat_<float>(3,1) << 0.f, 0.f, 0.f); mCurrentFrame.vObjSpeed_gt[i] = 0.0; mCurrentFrame.vnObjInlierID[i] = ObjIdNew[i]; continue; }

I don't have ground truths on my custom dataset so do you have any advice on how to run with this missing input?

Thanks for your great work!

Hi @Amrsaeed. Sorry for the late response. I think an easy way is to generate a file of ''ground truth pose'' for the objects, and run it as if it has ground truth. In this case the error metric does not make sense, but at least the system can run successfully. Another option is to change the code where it requires ground truth. Hope this may help.

No worries. I did try creating a ground truth pose file. However, to be able to efficiently use it I had to add a tracking module as an extra pre-processing step and matched its results with those of MRCNN to generate a pose file that is usable.

I Will attempt the code-changing approach once I get valid results. Still facing some issues with the estimated speeds not being very accurate. Thanks!

Hi @Amrsaeed,

The tracking is taken care of by the algorithm using optical flow, there's no need to have an extra tracking module as a pre-processing step, can you please share more of why this was a necessary pre-processing step in your case?

Is it also possible to share more details about your custom dataset and the type of speed you are getting. When you mention, the speed estimates are not very accurate, do you have an idea of what speed the objects in your dataset should be travelling at? Or how do you get an error of the speed estimates if you have no ground-truth data?

Hi @MinaHenein

The reason I added the tracking step is that, to my understanding, there is a matching step that happens according to the object ids. Without the tracking, my object ids would just keep incrementally increasing across frames. While it seems that the labels should be re-adjusted in lines 1573 to 1626 of Tracking.cc to match that of the last frame, when I tested with the tracking I was getting a new object for each new label and it wasn't matching correctly.

Is it also possible to share more details about your custom dataset and the type of speed you are getting. When you mention, the speed estimates are not very accurate, do you have an idea of what speed the objects in your dataset should be traveling at? Or how do you get an error of the speed estimates if you have no ground-truth data?

I tried running on a couple of different datasets. TuSimple and Cityscapes. For TuSimple I have relative speeds to my ego vehicle so I can make an estimation of what the speeds should be at. However, I think the main issue is somewhere else as I am getting high speeds for parked cars on cityscapes. I am currently suspecting the depth estimation module as I have used monodepth2 module while I believe you used https://github.com/siposcsaba89/sps-stereo for the kitti dataset.

Update: So I tested again on kitti without applying the tracking and using the depth estimation from sps-stereo. There is still some difference in the results but they are much closer now to your results. However, when running with an object_pose.txt file generated on the fly (instead of one generated from the kitti labels) based on the mask_rcnn detections with no correlation between frames, I am getting noisy results in terms of high speeds on parked cars and fluctuating results. It seems to me that the groundtruth labels play a role in the matching. Correct me if I am wrong.

Hi @Amrsaeed,

(1) About the tracking step, according to your description, it looks like for the data you tested, the system failed to track the object temporally, and you tried to add an extra tracking step to manage the tracking job. Does it fail in all cases you test or just a particular case? I may need more details to help you fix the issue. (2) Yes, we used sps-stereo to generate the depth for kitti dataset and treated as RGB-D input; but we also applied monodepth2 to generate depth and show the results together, as demonstrated in the paper (table II). The results used monodepth2 would be worse in general and may have false estimates in some cases, which fits your guess. (3) May I ask how much are the differences between the results you mentioned? If they are quite slightly, I think it is due to the non-deterministic nature in running the system, such as RANSAC processing. (4) The ground truth label is not used to track object, but indicate which object it is and align it with the estimated label, so that we can compute the error metrics (6-DoF and Speed errors). I think I know what is confusing here now, as I didn't mention clearly in the readme.txt file. In the KITTI dataset, to make it easier to compare with the ground truth object motion, I transfer the estimated semantic mask results to match the ground truth semantic labels in the KITTI dataset. For example, we have an estimated semantic mask of labeling two objects as 1, 2 and 3, then they are transferred to the labels that aligns with the ground truth semantic labels, say 87, 88, 89. In this case, we can easily align the estimated object with the ground truth labels and compute the error metrics. But again, this does not help to track object, and just for computing errors, i.e., if we have one object tracked by the system, which the temporal tracking ID ('nModLabel' in the code) is 1, it always stays 1 if tracking successfully. Although we also save the semantic label of this object ('nSemPosition' in the code), but it is not used to track. See lines 783-843 in Tracking.cc for more details. (5) For the noisy results in terms of high speeds on parked cars and fluctuating results, I guess it has something to do with the noisy optical flow, especially when the object is close to the edge of the captured image, where the flow values are with high noise uncertainty.

Hi @halajun

(1) About the tracking step, according to your description, it looks like for the data you tested, the system failed to track the object temporally, and you tried to add an extra tracking step to manage the tracking job. Does it fail in all cases you test or just a particular case? I may need more details to help you fix the issue.

In those datasets, I was restricted to using monodepth2 since I didn't have a stereo input. This seems to degrade the results further when combined with the underlying issue that I am still to figure out. The tracking does indeed add an extra layer of smoothness on the data but doesn't fix the actual issue.

(2) Yes, we used sps-stereo to generate the depth for kitti dataset and treated as RGB-D input; but we also applied monodepth2 to generate depth and show the results together, as demonstrated in the paper (table II). The results used monodepth2 would be worse in general and may have false estimates in some cases, which fits your guess.

That seems to fit yeah. One question though: after going through the monodepth2 repository, it seems that what they output is either the metric depth or the inverse scaled depth (Which they misleadingly call disp in the code), in contrast with the disparity value that is outputted through sps-stereo multiplied by a factor. I see in your code that you handle that by either using the baseline as the numerator in the stereo case and without it in the other case. Would it be right, in the mono case then to use a depth factor of 1 when directly using the output of monodepth2?

(3) May I ask how much are the differences between the results you mentioned? If they are quite slightly, I think it is due to the non-deterministic nature in running the system, such as RANSAC processing.

Here are two versions of the outputs: one using the on the fly generated object_pose.txt and the other from generating it from kitti labels. Barring the frames where the detections are dropped, the major difference is in the speed estimation for parked cars.

Kitti_Labels Generated

(4) The ground truth label is not used to track object, but indicate which object it is and align it with the estimated label, so that we can compute the error metrics (6-DoF and Speed errors). I think I know what is confusing here now, as I didn't mention clearly in the readme.txt file. In the KITTI dataset, to make it easier to compare with the ground truth object motion, I transfer the estimated semantic mask results to match the ground truth semantic labels in the KITTI dataset. For example, we have an estimated semantic mask of labeling two objects as 1, 2 and 3, then they are transferred to the labels that aligns with the ground truth semantic labels, say 87, 88, 89. In this case, we can easily align the estimated object with the ground truth labels and compute the error metrics. But again, this does not help to track object, and just for computing errors, i.e., if we have one object tracked by the system, which the temporal tracking ID ('nModLabel' in the code) is 1, it always stays 1 if tracking successfully. Although we also save the semantic label of this object ('nSemPosition' in the code), but it is not used to track. See lines 783-843 in Tracking.cc for more details.

Yeah, I saw that part which initially blocked the speed estimation when I didn't have a proper ground truth file.

(5) For the noisy results in terms of high speeds on parked cars and fluctuating results, I guess it has something to do with the noisy optical flow, especially when the object is close to the edge of the captured image, where the flow values are with high noise uncertainty.

I did doubt the optical flow results as they don't seem to be very accurate. So I tried comparing my optical flow output to the ones you provide in the kitti demo sequence. Using the TensorFlow version of PWC-NET I am getting a slightly different result from the one you are getting. Would you mind sharing with me which version you used and if there are any specific parameters you tweaked?

Hi @Amrsaeed,

In those datasets, I was restricted to using monodepth2 since I didn't have a stereo input. This seems to degrade the results further when combined with the underlying issue that I am still to figure out. The tracking does indeed add an extra layer of smoothness on the data but doesn't fix the actual issue.

I think this is an issue when applying the pretrained model of monodepth2 to new dataset to get depth estimate, i.e., the generalizability. One of our students have tried similar thing as you do, applied the monodepth2 model directly to a self-collected dataset and found the depth estimated are pretty worse. Then he just used an rgb-d setup to collect data instead. Anyway, to get better depth estimate for new dataset, I think model fine-tune or retrain would be probably necessary.

I see in your code that you handle that by either using the baseline as the numerator in the stereo case and without it in the other case. Would it be right, in the mono case then to use a depth factor of 1 when directly using the output of monodepth2?

For the mono case, we obtain depth directly from the output of monodepth2, not disparity. But it needs to be divided by a depth factor, which is 500 in our case for the kitti. Overall, it depends on your output of monodepth2.

Here are two versions of the outputs: one using the on the fly generated object_pose.txt and the other from generating it from kitti labels. Barring the frames where the detections are dropped, the major difference is in the speed estimation for parked cars.

I saw the videos you uploaded. I know what is going on here. The object_pose.txt file also save the ground truth bounding box of the objects in the image, which is used for displaying the estimated speed only (this is mentioned in README file). So if the object_poses are randomly generated, the bounding box will also randomly jump to those cars as well, that is why we see occasionally the parked cars are shown to be detected, but actually the system are tracking the moving cars. Please see lines 487-512 in Tracking.cc.

Yeah, I saw that part which initially blocked the speed estimation when I didn't have a proper ground truth file.

The speed estimation would be affected without a proper ground truth file; you just need to check the right output, e.g., 'sp_est_norm' or 'vSpeed[i].x' in the code.

I did doubt the optical flow results as they don't seem to be very accurate. So I tried comparing my optical flow output to the ones you provide in the kitti demo sequence. Using the TensorFlow version of PWC-NET I am getting a slightly different result from the one you are getting. Would you mind sharing with me which version you used and if there are any specific parameters you tweaked?

We used pytorch version of PWC-NET and we just used the default pre-trained model without any fine-tuning. I believe there should not be much difference between them, and the issue you have does not come from here.

Hope this may help.

halajun / VDO_SLAM

Running prediction without ground truth #21