How to determine the parameters in the plane.txt?

tding1 commented 3 years ago

Hi,

I am trying to reproduce the colmap results in your shiny datasets. I tried the exact same command for the scene "food". It generates 'hwf_cxcy.npy' and 'poses_bounds.npy', which are similar to your results.

However, how to determine the 4 parameters in the 'plane.txt'? For example, in your 'plane.txt', the four numbers for this scene are:

2.6 100 1 300

Where do these numbers come from? My colmap result shows:

Post-colmap
Images # 49
Points (50725, 3) Visibility (50725, 49)
Depth stats -0.06422890217140988 273.68360285004354 25.033925606886992

How to determine the first two numbers based on this information in your 'plane.txt'? And what about the last two numbers?

Moreover, I observe for some other scenes the 'plane.txt' contains only 3 numbers. For example, for the scene 'lab', the numbers in the file are

46 206 1

How to deal with the 4th missing value? What does that mean?

Thank you very much!

pureexe commented 3 years ago

Here is the code that reads planes.txt as an input

https://github.com/nex-mpi/nex-code/blob/eeff38c712ac9a665f09d7c2a3fdf48ae83f4693/utils/sfm_utils.py#L93-L107

if the last number is missing, it will use a default variable which is 200 https://github.com/nex-mpi/nex-code/blob/eeff38c712ac9a665f09d7c2a3fdf48ae83f4693/utils/sfm_utils.py#L38

Please take a look at the explanation of the first 3 numbers. https://github.com/nex-mpi/nex-code/wiki/planes.txt---near---far---inverse

Here is more explanation that might help you.

First and Second number - near / far

About the method to find the near and far (the first and the second number), the easiest way is using the percentile of the location of the point cloud.

COLMAP also generates a point cloud besides of camera parameter. You can take a look at LLFF's code on how to do a percentile.

close_depth in LLFF code is near / dmin in NeX code inf_depth in LLFF code is far/ dmax in NeX code

https://github.com/Fyusion/LLFF/blob/c6e27b1ee59cb18f054ccb0f87a90214dbe70482/llff/poses/pose_utils.py#L56-L88

Third number - inverse depth

inverse depth is determined by 0 or 1. If it contains things that places far away, set to 1.

Far away mean when the camera is moved, the object has a little (or even doesn't have) a parallax effect at all.

Fourth number - offset

When you look into images/2_mpic section in Tensorboard. You can see the gray border that spacing between the black edge and the object.

if the offset is set to 0. the object will fell off the MPI_C and the reconstruction will look bad around the edge.

But if set the offset too much, it will take a lot of memory while training. and total variation regularization might wipe out high-frequency detail.

If you wish to use a large offset, you have to reduce the -tvc parameter (default is 0.03).

Feel free to ask if things still not clear enough 😄😄😄

tding1 commented 3 years ago

Hi Pakkapon,

Thank you so much for the extremely helpful answer! After carefully checking your wiki and code, I still have some questions regarding those parameters:

First of all, a very naïve question: what is the purpose of separating these 4 parameters in a 'plane.txt' file rather than putting them altogether with other parameters in a configuration of args? I know this might be a design question but I am just wondering if there is any intuition behind this practice, e.g., highlight the importance of these 4 parameters? Or maybe it is just a practice that adopted from LLFF?
Next, for the third parameter, does that mean that we are free to choose it to be 0 or 1 depends on the scene property? For example, the wiki page says that if the scene contains objects lie far away from the camera, we can choose inverse depth (the third parameter is set to 1). However, I found that all the shiny datasets use the setting of inverse depth. For scenes like food or pasta, is that the case that the scene contains objects lie far away from the camera? I am a little bit confused of this point.
For the first two parameters, it is said in the wiki that:

NeX-MPI needs to select 1 camera to be a "Reference camera". The set of planes will forward-facing to this camera. NeX-MPI will automatically select the most centered camera. However, you can pick your own reference camera by providing the argument -ref_img

My question is that, must the "reference camera" be one of those cameras that takes the pictures? My understanding of MPI is that we only need to select a virtual camera that being the centroid of all the cameras. In your example of the wiki page, what if we do not have the middle camera? Does that mean the NeX-MPI will pick either the left or right camera to be the reference camera? My impression is that we need to construct a virtual middle camera that facing towards the scene, whose field of view is chosen such that it covers the region of space visible to the union of these cameras. Correct me if I am wrong :)
If the "Reference camera" is one of those existing cameras, and suppose we automatically find it by the code of load_llff. The README file for shiny dataset says that

The first two numbers are the distances from a reference camera to the first and last planes (near, far).

Currently we already have the reference camera, we can also read its close_depth and inf_depth, do we? In fact, I see the LLFF dmin/dmax is defined in this way (the reference_depth is obtained from the bds extracted from the last 2 columns of poses_bounds.npy): https://github.com/nex-mpi/nex-code/blob/eeff38c712ac9a665f09d7c2a3fdf48ae83f4693/utils/load_llff.py#L354-L361

Then what is the purpose of specifying the first two numbers manually, instead of directly using this estimated bounds dmin/dmax of LLFF for that reference camera? I think your answers above also points to the estimated bounds stored in poses_bounds.npy:

COLMAP also generates a point cloud besides of camera parameter. You can take a look at LLFF's code on how to do a percentile.

close_depth in LLFF code is near / dmin in NeX code inf_depth in LLFF code is far/ dmax in NeX code

https://github.com/Fyusion/LLFF/blob/c6e27b1ee59cb18f054ccb0f87a90214dbe70482/llff/poses/pose_utils.py#L56-L88

Then, my impression is that we don't need to set dmin/dmax but use the values from the LLFF bounds. Why do we need to manually specify the two numbers in the plane.txt file?
For a specific example, the first 2 parameters in the plane.txt file of dataset 'crest' are 4.7 200, where do they come from? It seems that they are not the one extracted from the estimated bounds in poses_bounds.npy. Does this mean we are also free to choose the first 2 parameters as long as they cover the real range of scene depth? For example, I can choose minimum dmin of all the cameras as the 1st parameter and choose the maximum dmax of all the cameras, though they may not be associated with the same camera. Is this doable?

Thank you in advance for you effort in answering these questions!

pureexe commented 3 years ago

1. what is the purpose of separating these 4 parameters in a 'plane.txt' file rather than putting them altogether with other parameters in a configuration of args?

Each scene has a different parameter near/far parameter, while parameters that list in args are shared across every scene.

Actually, you can provide argument -dmin -dmax -invz and -offset manually. (useful for debugging) https://github.com/nex-mpi/nex-code/blob/eeff38c712ac9a665f09d7c2a3fdf48ae83f4693/train.py#L70-L72

2.1 Does that mean that we are free to choose the inverse depth parameter to be 0 or 1 depends on the scene property?

depends on the scene property.

Choose 1 if the scene contains things that place far away. but if not an important object.

So, we use denser plane in the object near the caemra.

see 2.2 for example that we want to use a denser plane on pasta and food. The background is not so important so we use only a few planes to represent it. That is why we set it into 1.

2.2 For scenes like food or pasta, is that the case that the scene contains objects lie far away from the camera

The green highlight show far away part that we want to represent in the scene but with only a few planes.

pasta_far

food

3. must the "reference camera" be one of those cameras that takes the pictures

You can use a virtual camera to be a reference camera. You can set dataset.sfm.ref_rT (reference camera rotation matrix with transpose) and dataset.sfm.ref_t (reference camera translation vector) in to this dataset variable.

https://github.com/nex-mpi/nex-code/blob/eeff38c712ac9a665f09d7c2a3fdf48ae83f4693/train.py#L539

However, be careful on picking a reference camera. Some ray (pixel in MPI factum) might never be seen by any image in training set. So, we avoid this problem by select one of the training images as a reference camera to guarantee that the ray was seen at least in 1 image.

4. Can we automatically find dmin/dmax by the code of load_llff?

Yes, we can. However, this requires a point cloud in COLMAP to be good. Sometimes these points are missing in some parts of an object. (usually, affect dmin). So, we need to re-adjust the near plane (admin) to become closer to the camera)

5.1 scene 'crest' dmin/dmax are 4.7 200, where do they come from?

red: require a near plane to very close to the camera. green: require a far plane to very far from the camera.

crest_near_far png

5.2 Does this mean we are also free to choose the first 2 parameters as long as they cover the real range of scene depth

We want to select the range that covers the entire scene. This near and far plane is considered from the reference camera. You can freely set it. the homography will wrap the plane from the reference view into the training/testing view.

However, you still have to be careful on freely picking dmin/dmax

if the dmin/dmax doesn't cover entire scene,

1. The object is missing. because it can't represent the object on MPI
1. The object looks blurry. because MPI tries to place an object on the wrong depth

If the dmin/dmax set is too large to cover everything, some planes might have nothing on them. the plane in images/2_mpic will become entire gray color. It make NeX using fewer plane to represent the scene and leading to this kind of artifact.

JiuTongBro commented 3 years ago

Hi,

I read the code of depth generation in LLFF, and checked the point cloud file. So the source point cloud itself misses some parts of the scene right? Does it mean we can't generate the correct d_min and d_max from the provided data strictly, and we can only manually re-adjust the depth boundary?

Thanks a lot!

pureexe commented 3 years ago

I think so. We manually adjust to maximize MPI utilization. However, the depth generation from LLFF produces reasonable dmin/dmax in most all cases.

shamafiras commented 3 years ago

Hey @pureexe , Can you please explain about your answer earlier regarding the impact of total variation on large offsets: "But if set the offset too much, it will take a lot of memory while training. and total variation regularization might wipe out high-frequency detail." How can large offset cause total variation regularization to wipe out high frequency details ? As TV is per pixel loss, how can adding more pixels (larger offset) impact the quality of original pixels? Thanks for your incredible quality paper and high support. Firas

pureexe commented 3 years ago

total variation is using for smoothing mpi_c.

with a larger offset, it introduces more noise (more pixel to smoothing) lead to higher total variation.

The network will try to reduce the loss by smoothing mpi_c instead of reducing the reconstruction error. This can lead to a poor results.

nex-mpi / nex-code