Open tding1 opened 3 years ago
Here is the code that reads planes.txt as an input
if the last number is missing, it will use a default variable which is 200 https://github.com/nex-mpi/nex-code/blob/eeff38c712ac9a665f09d7c2a3fdf48ae83f4693/utils/sfm_utils.py#L38
Please take a look at the explanation of the first 3 numbers. https://github.com/nex-mpi/nex-code/wiki/planes.txt---near---far---inverse
Here is more explanation that might help you.
About the method to find the near and far (the first and the second number), the easiest way is using the percentile of the location of the point cloud.
COLMAP also generates a point cloud besides of camera parameter. You can take a look at LLFF's code on how to do a percentile.
close_depth
in LLFF code is near / dmin
in NeX code
inf_depth
in LLFF code is far/ dmax
in NeX code
inverse depth is determined by 0 or 1. If it contains things that places far away, set to 1.
Far away mean when the camera is moved, the object has a little (or even doesn't have) a parallax effect at all.
When you look into images/2_mpic
section in Tensorboard. You can see the gray border that spacing between the black edge and the object.
if the offset is set to 0
. the object will fell off the MPI_C and the reconstruction will look bad around the edge.
But if set the offset too much, it will take a lot of memory while training. and total variation regularization might wipe out high-frequency detail.
If you wish to use a large offset, you have to reduce the -tvc
parameter (default is 0.03).
Feel free to ask if things still not clear enough 😄😄😄
Hi Pakkapon,
Thank you so much for the extremely helpful answer! After carefully checking your wiki and code, I still have some questions regarding those parameters:
First of all, a very naïve question: what is the purpose of separating these 4 parameters in a 'plane.txt' file rather than putting them altogether with other parameters in a configuration of args? I know this might be a design question but I am just wondering if there is any intuition behind this practice, e.g., highlight the importance of these 4 parameters? Or maybe it is just a practice that adopted from LLFF?
Next, for the third parameter, does that mean that we are free to choose it to be 0 or 1 depends on the scene property? For example, the wiki page says that if the scene contains objects lie far away from the camera, we can choose inverse depth (the third parameter is set to 1). However, I found that all the shiny datasets use the setting of inverse depth. For scenes like food or pasta, is that the case that the scene contains objects lie far away from the camera? I am a little bit confused of this point.
For the first two parameters, it is said in the wiki that:
NeX-MPI needs to select 1 camera to be a "Reference camera". The set of planes will forward-facing to this camera. NeX-MPI will automatically select the most centered camera. However, you can pick your own reference camera by providing the argument -ref_img
My question is that, must the "reference camera" be one of those cameras that takes the pictures? My understanding of MPI is that we only need to select a virtual camera that being the centroid of all the cameras. In your example of the wiki page, what if we do not have the middle camera? Does that mean the NeX-MPI will pick either the left or right camera to be the reference camera? My impression is that we need to construct a virtual middle camera that facing towards the scene, whose field of view is chosen such that it covers the region of space visible to the union of these cameras. Correct me if I am wrong :)
If the "Reference camera" is one of those existing cameras, and suppose we automatically find it by the code of load_llff
. The README file for shiny dataset says that
The first two numbers are the distances from a reference camera to the first and last planes (near, far).
Currently we already have the reference camera, we can also read its close_depth
and inf_depth
, do we? In fact, I see the LLFF dmin/dmax
is defined in this way (the reference_depth
is obtained from the bds
extracted from the last 2 columns of poses_bounds.npy
):
https://github.com/nex-mpi/nex-code/blob/eeff38c712ac9a665f09d7c2a3fdf48ae83f4693/utils/load_llff.py#L354-L361
Then what is the purpose of specifying the first two numbers manually, instead of directly using this estimated bounds dmin/dmax
of LLFF for that reference camera? I think your answers above also points to the estimated bounds stored in poses_bounds.npy
:
COLMAP also generates a point cloud besides of camera parameter. You can take a look at LLFF's code on how to do a percentile.
close_depth
in LLFF code isnear / dmin
in NeX codeinf_depth
in LLFF code isfar/ dmax
in NeX code
Then, my impression is that we don't need to set dmin/dmax
but use the values from the LLFF bounds. Why do we need to manually specify the two numbers in the plane.txt
file?
For a specific example, the first 2 parameters in the plane.txt
file of dataset 'crest' are 4.7 200
, where do they come from? It seems that they are not the one extracted from the estimated bounds in poses_bounds.npy
. Does this mean we are also free to choose the first 2 parameters as long as they cover the real range of scene depth? For example, I can choose minimum dmin
of all the cameras as the 1st parameter and choose the maximum dmax
of all the cameras, though they may not be associated with the same camera. Is this doable?
Thank you in advance for you effort in answering these questions!
Each scene has a different parameter near/far parameter, while parameters that list in args are shared across every scene.
Actually, you can provide argument -dmin
-dmax
-invz
and -offset
manually. (useful for debugging)
https://github.com/nex-mpi/nex-code/blob/eeff38c712ac9a665f09d7c2a3fdf48ae83f4693/train.py#L70-L72
depends on the scene property.
Choose 1 if the scene contains things that place far away. but if not an important object.
So, we use denser plane in the object near the caemra.
see 2.2 for example that we want to use a denser plane on pasta and food. The background is not so important so we use only a few planes to represent it. That is why we set it into 1.
The green highlight show far away part that we want to represent in the scene but with only a few planes.
You can use a virtual camera to be a reference camera. You can set dataset.sfm.ref_rT
(reference camera rotation matrix with transpose) and dataset.sfm.ref_t
(reference camera translation vector) in to this dataset variable.
https://github.com/nex-mpi/nex-code/blob/eeff38c712ac9a665f09d7c2a3fdf48ae83f4693/train.py#L539
However, be careful on picking a reference camera. Some ray (pixel in MPI factum) might never be seen by any image in training set. So, we avoid this problem by select one of the training images as a reference camera to guarantee that the ray was seen at least in 1 image.
Yes, we can. However, this requires a point cloud in COLMAP to be good. Sometimes these points are missing in some parts of an object. (usually, affect dmin). So, we need to re-adjust the near plane (admin) to become closer to the camera)
red: require a near plane to very close to the camera. green: require a far plane to very far from the camera.
We want to select the range that covers the entire scene. This near and far plane is considered from the reference camera. You can freely set it. the homography will wrap the plane from the reference view into the training/testing view.
However, you still have to be careful on freely picking dmin/dmax
if the dmin/dmax doesn't cover entire scene,
If the dmin/dmax set is too large to cover everything,
some planes might have nothing on them. the plane in images/2_mpic
will become entire gray color.
It make NeX using fewer plane to represent the scene and leading to this kind of artifact.
Hi,
I read the code of depth generation in LLFF, and checked the point cloud file. So the source point cloud itself misses some parts of the scene right? Does it mean we can't generate the correct d_min and d_max from the provided data strictly, and we can only manually re-adjust the depth boundary?
Thanks a lot!
I think so. We manually adjust to maximize MPI utilization. However, the depth generation from LLFF produces reasonable dmin/dmax in most all cases.
Hey @pureexe , Can you please explain about your answer earlier regarding the impact of total variation on large offsets: "But if set the offset too much, it will take a lot of memory while training. and total variation regularization might wipe out high-frequency detail." How can large offset cause total variation regularization to wipe out high frequency details ? As TV is per pixel loss, how can adding more pixels (larger offset) impact the quality of original pixels? Thanks for your incredible quality paper and high support. Firas
total variation is using for smoothing mpi_c.
with a larger offset, it introduces more noise (more pixel to smoothing) lead to higher total variation.
The network will try to reduce the loss by smoothing mpi_c instead of reducing the reconstruction error. This can lead to a poor results.
Hi,
I am trying to reproduce the colmap results in your shiny datasets. I tried the exact same command for the scene "food". It generates 'hwf_cxcy.npy' and 'poses_bounds.npy', which are similar to your results.
However, how to determine the 4 parameters in the 'plane.txt'? For example, in your 'plane.txt', the four numbers for this scene are:
2.6 100 1 300
Where do these numbers come from? My colmap result shows:
Post-colmap
Images # 49
Points (50725, 3) Visibility (50725, 49)
Depth stats -0.06422890217140988 273.68360285004354 25.033925606886992
How to determine the first two numbers based on this information in your 'plane.txt'? And what about the last two numbers?
Moreover, I observe for some other scenes the 'plane.txt' contains only 3 numbers. For example, for the scene 'lab', the numbers in the file are
46 206 1
How to deal with the 4th missing value? What does that mean?
Thank you very much!