google / lasr

Code for "LASR: Learning Articulated Shape Reconstruction from a Monocular Video". CVPR 2021.
https://lasr-google.github.io/
Apache License 2.0
170 stars 17 forks source link

LASR with known camera intrinsics/extrinsics #10

Open ecmjohnson opened 3 years ago

ecmjohnson commented 3 years ago

Hello, I would like to run LASR with known camera intrinsics & extrinsics. I believe this is already implemented, but I'm having some trouble understanding how to accomplish this myself. The mechanisms seem to be two-fold: with the use_gtpose option and providing per-frame camera files (parsing code here). Could you clarify the functionality of these mechanisms? I was unable to find an example that made use of either, but if I missed one or you have one, that would also be helpful.

Another thing that confuses me is the scaling of the scale (lol) when use_gtpose is set even though the focal length is assigned equivalently if the camera files are provided or not that makes me think these two mechanisms might have different purposes and I am incorrectly conflating them.

Any clarification you can provide would be much appreciated! Thanks!

gengshan-y commented 3 years ago

Hi Erik, the "with gt camera" option is not tested in the latest codebase. I believe some non-trivial modifications are needed. It would be the easiest to go from the synthetic spot data with the following suggestions.

First, you will need to pass --use_gtpose to the program call.

Then, you will need to prepare camera files which is compatible with the parsing code. The files are stored as "database/DAVIS/Camera/Full-Resolution/sequence-name/00xxx.txt", and contains values in the cammat array used in synthetic data generation, which contains [focal length, translation-x, translation-y, quaternion-w, quaternion-x, quaternion-y, quaternion-z, translation-z]. Note the translation and rotation quaternions are coordinate transformations from object to camera space. We assumed the principle point to be (0,0) in the normalized device, and therefore did not record it.

What the parsing code does is to transform the camera intrinsics from the raw image coordinate to the cropped and scaled image coordinates.

I believe the scaling of the scale by 10x is not correct and should be removed.

gengshan-y commented 3 years ago

I pushed a modified script for synthetic data generation, which should save camera files as well. Please let me know how it goes.

ecmjohnson commented 3 years ago

Thanks for adding those camera files. The files are generated and read without any issue. I think the 10x factor might be related to the focal length for the synthetic spot scene being 10 and the scaling due to cropping not modifying, but overwriting the file's focal length. (The focal lengths were set as cam[0]=1./alp and I think it should be cam[0]/=alp with the 10x factor in mesh_net.py removed.)

When training, there appear to be some issue. The replacement of the predictions with the GT in this if opts.use_gtpose statement did not result in the same shapes, so I repeated the tensor to match the shapes which seemed to be the correct operation for the data to match. This was the only runtime error I encountered with --use_gtpose; however, it does not appear to be correct. The network never manages to learn to overfit the scale parameter (output always settles at 1e-12) and the depth also appears to be incorrect. Additionally, I tried overriding these incorrect predictions in predictor.py; however, the results do not appear to be correct either way (see below), so I suspect there is some error in how I am overriding the camera parameters.

I am using the ppoint computed in the code, which is a bit under 10 for all frames, not zero as you mentioned above. Should I set the principle point to be zero for the synthetic spot dataset?

The patch of my changes (I left the print statements to match logs below)

The output logs running spot3 with use_gtpose for the patch above

The extraction output running with the patch above using --use_gtpose option:

spot3_use_gtpose

The extraction output when overriding the scale and depth predictions with GT values (10 and 10) in predictor.py:

spot3_use_gtpose

gengshan-y commented 3 years ago

Hi, could you try modifying the following arguments in the scripts/spot3.sh?

--use_gtpose
--n_bones 1 
--n_hypo 1

This works for me.

gengshan-y commented 3 years ago

I pushed a new commit, please checkout this additional note.

ecmjohnson commented 3 years ago

Excellent! That does result in a much more convincing cow, although not quite as good as without GT camera parameters (due to use of single bone?).

spot3-gtcam

Does that mean if I use my own data with GT camera parameters I will be constrained to a single bone? This would probably cause issues for articulating objects and I don't understand why the use of GT camera parameters constrains the number of bones and number of hypothesis symmetry planes. Also, would I only use the GT poses for the first optimization call like in 'spot3-gtcam.sh`?

gengshan-y commented 3 years ago

The problem is likely due to the small number of views (this example only uses 3 views) and lack of smoothness constraint . The shape should be better if you increase it to ~20 views or use a larger weight for laplacian smoothness .

It should be easy to modify the use_gtpose option to be compatible with more than one bone. My guess is that your previous modification repeated the camera predictions in the wrong way. For instance, you only want to modify the root body pose to be the ground-truth poses, but not the bone transformations, which ground-truth is not available. You would only want to repeat in the hypothesis axis to make the shape match.

Again, the use_gtpose option is not supported by this repo and I don't have cycles to push a commit that fully address the problem. But feel free to follow up if you want to continue on it.

ecmjohnson commented 2 years ago

Sorry to return to this after such an extensive absence, but I was wondering if you could clarify for me why we only pass --use_gtpose to the first optimization call for the spot sequence? How would this scale to the more complex datasets using 5 optimization steps (e.g. --use_gtpose for first 4 optimizations and then --nouse_gtpose for the last)?

Also, should the produced per-frame cam*.txt files match those provided (if the format is converted of course)? For the spot dataset this doesn't appear to be the case.

gengshan-y commented 2 years ago

The pose CNN is trained with the provided camera poses in the first optimization, and that's why it still works without passing --use_gtpose to the 2nd optimization. If it's desirable to always use GT cameras (e.g., the cameras are very accurate), you could pass --use_gtpose to all optimizations (I haven't tested).

Can you elaborate on the second point? The result of which optimization does not match? Is it off a lot?

ecmjohnson commented 2 years ago

I was just wondering why the output cam*.txt files (produced by the extract.sh script) always include no rotation or translation. I see why that is the case, but I don't understand it. Is the root rotation and translation removed from the meshes when they are extracted (ie. they are aligned with the camera)?

ecmjohnson commented 2 years ago

An additional question about those output cam*.txt files: Why are the written out focal length and principal point scaled by 128? Should I not be looking at these files as the final optimized cameras?

gengshan-y commented 2 years ago

Is the root rotation and translation removed from the meshes when they are extracted (ie. they are aligned with the camera)?

Your understanding is correct.

The values written to cam*.txt should correspond to the focal length and principal points in pixel units (before cropping to object bounding box). If I remember correctly, the focal length predicted by the network correspond to the true value (before cropping) but scaled by 1 /128.