tsattler / visuallocalizationbenchmark

342 stars 58 forks source link

Pose Overlapping in Aachen day-night dataset #22

Closed Enderych closed 4 years ago

Enderych commented 4 years ago

Hello, I did some test to calculate the intersection between two camera's frustum, in order to define their 'spatial similarity'. Then I found that in the db dataset (4328 images from _aachen_cvpr2018db.nvm), some images have very close pose, but in a different location (after appearance check). For example: db/1045.jpg & db/2506.jpg, db/1135.jpg & db/3355.jpg ...... I am thinking that maybe there are several subsets (sub-models) of images in this dataset, and some of the db images in different subsets have overlapped absolute pose, is that right? So if we predict the absolute pose for one query image, it could be referenced to many different subsets? Thank you in advance.

Enderych commented 4 years ago

Or maybe I could ask this question in another way. Is it OK if I extract the pose of each db image from _aachen_cvpr2018db.nvm directly? It looks like that: NVM_V3

4328 db/1045.jpg 1084.59000000 -0.030550400000 -0.212047000000 0.120580000000 0.969311000000 738.038000000000 -6.174860000000 -41.837100000000 0.081098900000 0 ......

<0> So the prediction of a query image will be based on several _db_ images in the same location, but they could have very similar pose value with other images from a different location, is that right? Then can we add some translation to every subset or location, in order to make them not overlap each other?
tsattler commented 4 years ago

I am not sure I understand your question. What are you trying to achieve?

Enderych commented 4 years ago

I want to extract the pose of every db image, so I can evaluate the geometry relationship between them.

That's what I have done: I extract the file name (db/1045.jpg in last comment example), quaternion WXYZ (-0.030550400000 -0.212047000000 0.120580000000 0.969311000000) and position XYZ (-6.174860000000 -41.837100000000 0.081098900000) directly from _aachen_cvpr2018db.nvm for every db image (4328 in total). Then I found that some images, even they are geometrically close ( short WXYZ & XYZ distance), their appearances are very different, that's why I think they are not captured in the same location. (such as image pairs db/1045.jpg & db/2506.jpg, or db/1135.jpg & db/3355.jpg)

Here is my assumption: several small models have been built from different locations. The poses of images have been normalized (referenced) to every submodel's reference point, so their XYZ value will not be too large. So images 1 and 2 might come from submodel A and B, separately, while images 1 and 2 could have very close distance based on their pose value, but in fact, they are referenced to different reference points. That's the reason why images 1 and 2 may have different appearances.

My suggestion: add a random translation for every submodel, in order to avoid the overlap between images from different submodels.

tsattler commented 4 years ago

There seems to be a problem in the way that you are reading out the positions. For db/1045.jpg, the position of the camera center in model coordinates is (738.038, -6.17486, -41.8371), as stored in the .nvm file. Maybe you are confusing position and translation?

The image pairs that you list are taken roughly 50 meters apart.

As described in the paper, we constructed a single reference model. No submodels were build and then aligned.

tsattler commented 4 years ago

Actually, looking at your XYZ coordinates, it seems that you are reading in the position incorrectly. The last entry (0.081098900000) is the radial distortion term, not the Z coordinate. You are missing the X coordinate and just storing Y, Z, radial distortion.

Enderych commented 4 years ago

OK, I see, I thought the camera center means the horizontal pixel center of each image, then later three as XYZ. Now I see I was wrong, thank you.

two more questions:

  1. Could you please also offer day time query-db image pairs just like you offered for night time, every query image with 20 candidates from db?

  2. when I upload the result to your benchmark, two .txt files (day and night separately) with a format like that: filename_without_path W A B C X Y Z IMG_20161227_173116.jpg 0.0702079 0.835457 -0.0640913 -0.541271 -329.332 127.496 654.254 ...... is that right?

tsattler commented 4 years ago

Regarding your question:

  1. We will not release such information as we try to keep the reference poses as hidden as possible.
  2. You should upload a single file with both the day and night images. The description of the file format from the readme file of the dataset is Please submit your results as a text file using the following file format. For each query image for which your method has estimated a pose, use a single line. This line should store the result as name.jpg qw qx qy qz tx ty tz.
    Here, name corresponds to the filename of the image, without any directory names. qw qx qy qz represents the rotation from world to camera coordinates as a unit quaternion. tx ty tz is the camera translation (not the camera position).

Does this answer your question?

Enderych commented 4 years ago

I think I understand most of your answer, thank you. While here: 'tx ty tz is the camera translation (not the camera position).' I am confused, I think tx ty tz is the camera translation from world to camera coordinates, while also the camera position in the world coordinates, is that right? Or do you mean T as the translation from world to camera coordinates, t means the camera position in the world coordinates, R means the rotation from world to camera coordinates, then: T = Rt?

Enderych commented 4 years ago

But I still think T = t, since both of them are in the world coordinate.

tsattler commented 4 years ago

This if from the readme file that comes with the Aachen dataset:

Conventions

The different types of models store poses in different formats.

We strongly recommend that you familiarize yourself with the file format of the models that you plan to use.

Does this answer your question?

Enderych commented 4 years ago

Yes, very clear. Thank you.

So I will use the ### R, c from NVM to train my model, and predict the ### R, c of every query image, and compute the translation by ### t = -(R * c), then upload the final result in a .txt file.

Next time I will search my questions in the readme file of dataset first to make it easier for understanding, thanks again.