Cannot reproduce the results of paper using this training script

nianticlabs / simplerecon

[ECCV 2022] SimpleRecon: 3D Reconstruction Without 3D Convolutions

Other

1.3k stars 121 forks source link

Cannot reproduce the results of paper using this training script #15

Closed AppleAndBanana closed 1 year ago

AppleAndBanana commented 1 year ago

Hi, this work is very interesting and amazing. However, when I download the ScanNet dataset and try to train my own model from scratch using the method mentioned in readme, I found I cannot reproduce the accuracy result of your paper(or your hero_model). I use 4 3090 instead of A100 with batch_size=8 in each gpu(in default, this training script uses 2 A100 with batch_size=16 in each gpu), and I don't modify other options. After training, I get this result:

"metrics_string": "abs_diff abs_rel sq_rel rmse rmse_log a5 a10 a25 a0 a1 a2 a3 model_time ", "scores_string": "0.1534, 0.0842, 0.0288, 0.2145, 0.1088, 45.6871, 71.1059, 93.1222, 71.1059, 93.1222, 98.7887, 99.7510, 118.0729, "

Are there any modification I need to do for this training scripts?

mohammed-amr commented 1 year ago

Hello,

The hero_model in the repo is the same one used for numbers and visualizations for the paper.

Can you post your train and val loss plots? Have you used the data split files directly from the repo?

AppleAndBanana commented 1 year ago

Hello,

The hero_model in the repo is the same one used for numbers and visualizations for the paper.

Can you post your train and val loss plots? Have you used the data split files directly from the repo?

Thanks for your reply. Here are my train and val loss plots. Since the loss NaN usually occurs in my train process, I have to resume my experiment from about 56k iters and end it in about 100k iters. So there are 2 loss curves in my plots(orange line: from scratch, blue line: resume). 9b7e36bb-26a0-44b2-9f31-134c17976941 2a9ccadf-d667-45b4-bff5-49e96276bd45 456b4762-d614-478c-a628-bc5c16acc14e e155cd8d-5aab-44eb-998b-774dded4a2c7 febbdf4d-7f75-46d6-a61d-bb101f115f46 eec78d6c-f57f-44ec-9210-d82ccab9f013 5c4fc29d-b2c6-4782-a1f4-84e582bd79d2 af156ca2-82fe-43a9-a87d-b0154eb54ab3 552b464e-107e-4959-b9ec-5b046095fd72 36d14d4d-5ccf-48db-92ff-af69f9d71888 84715bb3-6630-430c-8c45-8109e777e082 97688336-b392-4d75-bdc7-4a6a361c8de9 ae47471b-f94b-45c2-ad27-a2b9becbae09

For the data split files, since my scannet dataset has different file names like: 0781c396-d7b3-4c11-a1eb-e4d71117613f So I regenerate the 'train_eight_view_deepvmvs.txt', 'val_eight_view_deepvmvs.txt' and 'test_eight_view_deepvmvs.txt' by 'generate_train_tuples.py' and 'generate_test_tuples.py'.

mohammed-amr commented 1 year ago

Looks like you've had a nan in your training, evident by the spiked losses at ~65k.

What images are you using for training? Have you run the scripts we provided to downscale images? Are you using the exported jpegs from the ScanNet dataset directly? Have you preprocessed them beforehand? Which intrinsics files are you using?

mohammed-amr commented 1 year ago

There are a few reasons that could cause this, and known remedies.

Are you generating the val files using generate_train_tuples.py btw? If not, then it's likely using test style tuples which haven't been shuffled.

AppleAndBanana commented 1 year ago

Since I have used the official scripts of scannet to download this dataset before, I don't use the scripts you provided to download it again.

I extract the images, depths and poses from xxx.sens by official scripts of ScanNet, copy .txt and _vh_clean_2.ply files together, and reorganize all of them to be the same file structure as yours in the readme. In this process, I use the exported jpegs from the ScanNet dataset directly, without downscaling them or doing other preprocess(So the rgb size will be 1296x968 or something else, the depth size will be 640x480).

The intrinsics files I used are the origin sceneid_xxx.txt files downloaded by the download scripts without any changes.

After preparing all of these data, I using generate_train_tuples.py to generate the train and val files and use generate_test_tuples.py to generate the test files.

mohammed-amr commented 1 year ago

Thanks for the details! More questions to hammer down the problem:

Are you using the train config from the repo directly? Can you share the command you used to fire training? Can you share the images from the train log? Can you share your tuple files?

AppleAndBanana commented 1 year ago

Thanks for the details! More questions to hammer down the problem:

Are you using the train config from the repo directly? Can you share the command you used to fire training? Can you share the images from the train log? Can you share your tuple files?

Well, I use the train config from the repo with little changes in gpus and iters. Here is my command(for 4x3090):

source activate simplerecon CUDA_VISIBLE_DEVICES=0,1,2,3 python3 train.py \ --name HERO_MODEL \ --log_dir train_results/exp_29/ \ --config_file configs/models/hero_model.yaml \ --data_config configs/data/scannet_default_train.yaml \ --gpus 4 \ --batch_size 8 \ --val_batch_size 16 \ --image_width 512 \ --image_height 384 \ --num_workers 8 \ --lr 0.0001 \ --max_steps 100000 #lr_steps: [70000, 80000]

The images from the train log are shown below(version20 means from scratch, version24 means resume): 0e0b7211-95bf-403d-9734-d55efadb306c efc9e9ba-7ace-4e46-94b9-7424d1aa034f 0318d530-ff13-4144-baec-3b0d31747a9e 93ccef7e-4a14-4670-a24c-bec7e7a60232 97bf40fa-049e-4e6a-8a5f-a63d52d23b0c 9e01af68-e001-4faf-9e26-91950bc51bb3 b3137877-64f1-458b-b04f-6eccba794291 92ab335a-a4f6-49e6-9420-e60dae39ae4c 2a864a3b-b080-4942-b9cf-e2a304eb03c2 13e60a85-7776-4128-8299-e16dd43863bf 29435f9b-5f20-4a59-b62d-13e6228ecf16 be148bff-d2be-41e8-a832-6492a4673ed4 bf7a3ec0-01a2-4f54-9283-3d4a317635f0 4654fd31-b101-4fba-8a66-68b2aa49e57e

And my tuple files are here: https://drive.google.com/file/d/1-VIptqXi-zDGtZ0XW8MF2JpQRczf_Rq5/view?usp=sharing, https://drive.google.com/file/d/15wmHNvyzLg-t990VVn9x6ois1qC7Gb3n/view?usp=sharing, https://drive.google.com/file/d/1uE4eeHwH6oqCRmunokRBEWvutj8zn4tK/view?usp=sharing

mohammed-amr commented 1 year ago

Oh wow! Yeah there is something really off with those cost volumes and the normal estimates in both the GT and the prediciton. The cost volumes look nothing like what they should be, For reference, all of my runs with metadata look something like this:

Preds:

GT normals:

And normals from pred:

What are you using for your environment? Have you installed it directly from the env file we provide?

AppleAndBanana commented 1 year ago

Your metadata looks very well! It seems that my cv_min and normals data are incorrectly calculated even in GT.

I use conda env to install it from the env file you provide. Since my system is Ubuntu 16.04 and some python packages are unsupported in specific version, I change the versions of some packages to ensure this repo can run without error or warning.

Now I should maybe check my environment and the code about cost volumes and normals to find what caused the problem.

mohammed-amr commented 1 year ago

I think changing the versions of the packages you used is probably the problem there. I suspect this is a pytorch problem. What versions of the libraries are you using? I'm not sure how you managed to get the 3090s to cooperate with decent drivers and cuda on 16.04 😅

mohammed-amr commented 1 year ago

This might be an indexing problem from deprecated torch functionalities that's causing this (if the version of torch you're using is very different). That might explain the repeating patterns. We use meshgrid extensively for all backprojection operations.

Or this could be a bad case of incompatible CUDA/torch for the OS and GPUs you're using. List the packages you changed if you can.

AppleAndBanana commented 1 year ago

This might be an indexing problem from deprecated torch functionalities that's causing this (if the version of torch you're using is very different). That might explain the repeating patterns. We use meshgrid extensively for all backprojection operations.

Or this could be a bad case of incompatible CUDA/torch for the OS and GPUs you're using. List the packages you changed if you can.

Yes, you are right! I checked the torch.meshgrid() function in BackprojectDepth class and found it different between different torch versions. So I changed the code like this:

from packaging import version
......
xx, yy = None, None
if version.parse(torch.__version__) >= version.parse('1.10.0'):
    xx, yy = torch.meshgrid(
        torch.arange(self.width), 
        torch.arange(self.height), 
        indexing='xy',
    )
else:
    yy, xx = torch.meshgrid(
        torch.arange(self.height), 
        torch.arange(self.width),
    )
......

and then my cost volume images and normal images in train log seem reasonable: 13d7021e-20d3-46ca-967f-7368737103a5 89479eef-4d30-48bc-8049-fbd30c89188a

Now I will retrain my model with this code. Thanks for your help!

mohammed-amr commented 1 year ago

Eyyyyy! Excellent! Glad it worked. Let me know what it ends up doing.

AppleAndBanana commented 1 year ago

It works! Now I get reasonable results like this after training:

And the model's test scores are(all frames average): "metrics_string": "abs_diff abs_rel sq_rel rmse rmse_log a5 a10 a25 a0 a1 a2 a3 model_time ", "scores_string": "0.0891, 0.0437, 0.0127, 0.1476, 0.0677, 72.8196, 90.4030, 98.1138, 90.4030, 98.1138, 99.5454, 99.8388, 189.6376, "

and(all scenes average): "metrics_string": "abs_diff abs_rel sq_rel rmse rmse_log a5 a10 a25 a0 a1 a2 a3 model_time ", "scores_string": "0.0926, 0.0450, 0.0136, 0.1533, 0.0693, 71.9483, 89.8246, 97.9610, 89.8246, 97.9610, 99.5050, 99.8342, 188.2653, "

which are the same as the paper's numbers.

Thanks again for your help!

mohammed-amr commented 1 year ago

Welcome! Glad it's all good.