vislearn / dsacstar

DSAC* for Visual Camera Re-Localization (RGB or RGB-D)
BSD 3-Clause "New" or "Revised" License
243 stars 36 forks source link

Train Cambridge #5

Closed Song-Jingyu closed 3 years ago

Song-Jingyu commented 3 years ago

Hi!

I want to confirm the parameters used to train the Cambridge Landmark Dataset. Do you use the default parameters to train to achieve the report accuracies (RGB only)? I am using the dsacstar to train on a custom outdoor dataset with RGB only (did not work well using default parameters). I want to have your help on suggesting a direction on tuning the parameters. Thanks so much!

Song-Jingyu commented 3 years ago

Hi!

I want to confirm the parameters used to train the Cambridge Landmark Dataset. Do you use the default parameters to train to achieve the report accuracies (RGB only)? I am using the dsacstar to train on a custom outdoor dataset with RGB only (did not work well using default parameters). I want to have your help on suggesting a direction on tuning the parameters. Thanks so much!

I want to comment that the dataset I use is collected in a large area and I sample a portion from it. NCLT. The best eval result I got from dsacstar is ~18m translation error on the training set. I think that is far from the expected result because I see the Cambridge landmark can achieve less than 1m translation result. I want to have your help on suggesting the possible direction I can refine the performance.

Besides, the image I use is like a non-rotated image, should I rotate it 90 degrees so that the image looks more like what human will see? And should I undistort the image? Does undistorting help?

Thanks so much and look forward to your response!

ebrach commented 3 years ago

Hi Jingyu,

DSAC* uses the same parameters for all datasets, including Cambridge Landmarks.

Regarding your use case on a new dataset I would look out for the following things:

Best, Eric

Song-Jingyu commented 3 years ago

Hi Eric,

Thanks for your response! That makes a lot of sense!

For the camera pose, I think I did correctly. And the extend for the scene is about 300m x 130m, does it exceed the capability of dsacstar?

I was curious on when should I end the first-stage training. I just found the loss fluctuates at 40 and the loss for a whole epoch (after a couple of epochs) decreases very slightly. Is that OK or should I let stage 1 run as long as the loss is decreasing?

I also tried to use different target depths (10 and 20) and train the first stage for about 120 epochs. The result shows target depth 20 trained network has less loss on the second stage (I only run one epoch and found the loss for depth 10 is 3.2e+06, depth20 is 2.97e+06). Does this mean I found one of the correct directions to tune parameters? May I have your help on suggesting some parameters that is worth to trying (for target depth, min depth, maxdepth, inittolerance)? I have attached a sample picture for your reference.

For the second stage, we observed the loss is still around 700 after training the second stage, so the result is not good at all. Should I change parameters e.g. -ia, -t, -hyps, -sc? Besides, we observed that for test, the rotation error is satisfying but the translation error is very large (around 100m). So should we increse the weight of translation part of pose loss?

Sorry I have so many questions but I really struggle on generating justifiable result for my course project. So ant help from you would be highly appreciated! Thanks so much!

Best, Jingyu

2012-01-08_1326031230131494 color

ebrach commented 3 years ago

Hi Jingyu,

the loss for the first stage sounds quite large, you would want something below 10. The second stage, with a loss > 1e6, is not starting from a sensible point, suggesting that the first stage failed.

The scene is quite large, and this could very well be the bottle neck here. I would suggest to split the dataset into smaller parts as described in the ESAC paper (https://github.com/vislearn/esac). When DSAC* works on the smaller parts, you know that the scene size was indeed the problem. You do not need to implement/port the whole ESAC-scene-classification part if you do not care about efficiency too much. GIven a test image, you can just iterate over all (part-)networks and return the pose with largest inlier count over all networks.

As said before, the image distortion could be a factor here too. Ideally you would undistort all images as a pre-processing step.

Best, Eric