Train Cambridge - Githubissues

Song-Jingyu commented 3 years ago

Hi!

I want to confirm the parameters used to train the Cambridge Landmark Dataset. Do you use the default parameters to train to achieve the report accuracies (RGB only)? I am using the dsacstar to train on a custom outdoor dataset with RGB only (did not work well using default parameters). I want to have your help on suggesting a direction on tuning the parameters. Thanks so much!

Song-Jingyu commented 3 years ago

Hi!

I want to confirm the parameters used to train the Cambridge Landmark Dataset. Do you use the default parameters to train to achieve the report accuracies (RGB only)? I am using the dsacstar to train on a custom outdoor dataset with RGB only (did not work well using default parameters). I want to have your help on suggesting a direction on tuning the parameters. Thanks so much!

I want to comment that the dataset I use is collected in a large area and I sample a portion from it. NCLT. The best eval result I got from dsacstar is ~18m translation error on the training set. I think that is far from the expected result because I see the Cambridge landmark can achieve less than 1m translation result. I want to have your help on suggesting the possible direction I can refine the performance.

Besides, the image I use is like a non-rotated image, should I rotate it 90 degrees so that the image looks more like what human will see? And should I undistort the image? Does undistorting help?

Thanks so much and look forward to your response!

ebrach commented 3 years ago

Hi Jingyu,

DSAC* uses the same parameters for all datasets, including Cambridge Landmarks.

Regarding your use case on a new dataset I would look out for the following things:

Is the camera pose convention correct? The camera pose should transform points in camera coordinates to world coordinates. The pose should follow the OpenCV coordinate convention (not e.g. the OpenGL convention)
DSAC uses a simple pin-hole camera model. It does not support camera distortion. If there is severe distortion in the images (as seems to be the case for the NCLT dataset), it might be necessary to undistort them before passing them to DSAC
It is not necessary to rotate images to align with the gravity vector (ie. what humans would see) but the images have to be consistent with the camera poses. If you rotate images, the camera poses would also need rotation.
The scale of the dataset (the spatial extent) might be a problem. If possible, try to cut out a small area from the complete dataset to see whether scale is a problem. If DSAC works on a small area, you know that the data is formated in the right way and scale is the issue. In this case, combining DSAC with ESAC (ICCV19) might be a solution.

Best, Eric

Song-Jingyu commented 3 years ago

Hi Eric,

Thanks for your response! That makes a lot of sense!

For the camera pose, I think I did correctly. And the extend for the scene is about 300m x 130m, does it exceed the capability of dsacstar?

I was curious on when should I end the first-stage training. I just found the loss fluctuates at 40 and the loss for a whole epoch (after a couple of epochs) decreases very slightly. Is that OK or should I let stage 1 run as long as the loss is decreasing?

I also tried to use different target depths (10 and 20) and train the first stage for about 120 epochs. The result shows target depth 20 trained network has less loss on the second stage (I only run one epoch and found the loss for depth 10 is 3.2e+06, depth20 is 2.97e+06). Does this mean I found one of the correct directions to tune parameters? May I have your help on suggesting some parameters that is worth to trying (for target depth, min depth, maxdepth, inittolerance)? I have attached a sample picture for your reference.

For the second stage, we observed the loss is still around 700 after training the second stage, so the result is not good at all. Should I change parameters e.g. -ia, -t, -hyps, -sc? Besides, we observed that for test, the rotation error is satisfying but the translation error is very large (around 100m). So should we increse the weight of translation part of pose loss?

Sorry I have so many questions but I really struggle on generating justifiable result for my course project. So ant help from you would be highly appreciated! Thanks so much!

Best, Jingyu

2012-01-08_1326031230131494 color

ebrach commented 3 years ago

Hi Jingyu,

the loss for the first stage sounds quite large, you would want something below 10. The second stage, with a loss > 1e6, is not starting from a sensible point, suggesting that the first stage failed.

The scene is quite large, and this could very well be the bottle neck here. I would suggest to split the dataset into smaller parts as described in the ESAC paper (https://github.com/vislearn/esac). When DSAC* works on the smaller parts, you know that the scene size was indeed the problem. You do not need to implement/port the whole ESAC-scene-classification part if you do not care about efficiency too much. GIven a test image, you can just iterate over all (part-)networks and return the pose with largest inlier count over all networks.

As said before, the image distortion could be a factor here too. Ideally you would undistort all images as a pre-processing step.

Best, Eric

vislearn / dsacstar

Train Cambridge #5