Closed AppleAndBanana closed 1 year ago
Hello,
The hero_model
in the repo is the same one used for numbers and visualizations for the paper.
Can you post your train and val loss plots? Have you used the data split files directly from the repo?
Hello,
The
hero_model
in the repo is the same one used for numbers and visualizations for the paper.Can you post your train and val loss plots? Have you used the data split files directly from the repo?
Thanks for your reply. Here are my train and val loss plots. Since the loss NaN usually occurs in my train process, I have to resume my experiment from about 56k iters and end it in about 100k iters. So there are 2 loss curves in my plots(orange line: from scratch, blue line: resume).
For the data split files, since my scannet dataset has different file names like: So I regenerate the 'train_eight_view_deepvmvs.txt', 'val_eight_view_deepvmvs.txt' and 'test_eight_view_deepvmvs.txt' by 'generate_train_tuples.py' and 'generate_test_tuples.py'.
Looks like you've had a nan in your training, evident by the spiked losses at ~65k.
What images are you using for training? Have you run the scripts we provided to downscale images? Are you using the exported jpegs from the ScanNet dataset directly? Have you preprocessed them beforehand? Which intrinsics files are you using?
There are a few reasons that could cause this, and known remedies.
Are you generating the val files using generate_train_tuples.py
btw? If not, then it's likely using test style tuples which haven't been shuffled.
Since I have used the official scripts of scannet to download this dataset before, I don't use the scripts you provided to download it again.
I extract the images, depths and poses from xxx.sens by official scripts of ScanNet, copy .txt and _vh_clean_2.ply files together, and reorganize all of them to be the same file structure as yours in the readme. In this process, I use the exported jpegs from the ScanNet dataset directly, without downscaling them or doing other preprocess(So the rgb size will be 1296x968 or something else, the depth size will be 640x480).
The intrinsics files I used are the origin sceneid_xxx.txt files downloaded by the download scripts without any changes.
After preparing all of these data, I using generate_train_tuples.py to generate the train and val files and use generate_test_tuples.py to generate the test files.
Thanks for the details! More questions to hammer down the problem:
Are you using the train config from the repo directly? Can you share the command you used to fire training? Can you share the images from the train log? Can you share your tuple files?
Thanks for the details! More questions to hammer down the problem:
Are you using the train config from the repo directly? Can you share the command you used to fire training? Can you share the images from the train log? Can you share your tuple files?
Well, I use the train config from the repo with little changes in gpus and iters. Here is my command(for 4x3090):
source activate simplerecon CUDA_VISIBLE_DEVICES=0,1,2,3 python3 train.py \ --name HERO_MODEL \ --log_dir train_results/exp_29/ \ --config_file configs/models/hero_model.yaml \ --data_config configs/data/scannet_default_train.yaml \ --gpus 4 \ --batch_size 8 \ --val_batch_size 16 \ --image_width 512 \ --image_height 384 \ --num_workers 8 \ --lr 0.0001 \ --max_steps 100000 #lr_steps: [70000, 80000]
The images from the train log are shown below(version20 means from scratch, version24 means resume):
And my tuple files are here: https://drive.google.com/file/d/1-VIptqXi-zDGtZ0XW8MF2JpQRczf_Rq5/view?usp=sharing, https://drive.google.com/file/d/15wmHNvyzLg-t990VVn9x6ois1qC7Gb3n/view?usp=sharing, https://drive.google.com/file/d/1uE4eeHwH6oqCRmunokRBEWvutj8zn4tK/view?usp=sharing
Oh wow! Yeah there is something really off with those cost volumes and the normal estimates in both the GT and the prediciton. The cost volumes look nothing like what they should be, For reference, all of my runs with metadata look something like this:
Preds:
GT normals:
And normals from pred:
What are you using for your environment? Have you installed it directly from the env file we provide?
Your metadata looks very well! It seems that my cv_min and normals data are incorrectly calculated even in GT.
I use conda env to install it from the env file you provide. Since my system is Ubuntu 16.04 and some python packages are unsupported in specific version, I change the versions of some packages to ensure this repo can run without error or warning.
Now I should maybe check my environment and the code about cost volumes and normals to find what caused the problem.
I think changing the versions of the packages you used is probably the problem there. I suspect this is a pytorch problem. What versions of the libraries are you using? I'm not sure how you managed to get the 3090s to cooperate with decent drivers and cuda on 16.04 😅
This might be an indexing problem from deprecated torch functionalities that's causing this (if the version of torch you're using is very different). That might explain the repeating patterns. We use meshgrid extensively for all backprojection operations.
Or this could be a bad case of incompatible CUDA/torch for the OS and GPUs you're using. List the packages you changed if you can.
This might be an indexing problem from deprecated torch functionalities that's causing this (if the version of torch you're using is very different). That might explain the repeating patterns. We use meshgrid extensively for all backprojection operations.
Or this could be a bad case of incompatible CUDA/torch for the OS and GPUs you're using. List the packages you changed if you can.
Yes, you are right! I checked the torch.meshgrid() function in BackprojectDepth class and found it different between different torch versions. So I changed the code like this:
from packaging import version
......
xx, yy = None, None
if version.parse(torch.__version__) >= version.parse('1.10.0'):
xx, yy = torch.meshgrid(
torch.arange(self.width),
torch.arange(self.height),
indexing='xy',
)
else:
yy, xx = torch.meshgrid(
torch.arange(self.height),
torch.arange(self.width),
)
......
and then my cost volume images and normal images in train log seem reasonable:
Now I will retrain my model with this code. Thanks for your help!
Eyyyyy! Excellent! Glad it worked. Let me know what it ends up doing.
It works! Now I get reasonable results like this after training:
And the model's test scores are(all frames average): "metrics_string": "abs_diff abs_rel sq_rel rmse rmse_log a5 a10 a25 a0 a1 a2 a3 model_time ", "scores_string": "0.0891, 0.0437, 0.0127, 0.1476, 0.0677, 72.8196, 90.4030, 98.1138, 90.4030, 98.1138, 99.5454, 99.8388, 189.6376, "
and(all scenes average): "metrics_string": "abs_diff abs_rel sq_rel rmse rmse_log a5 a10 a25 a0 a1 a2 a3 model_time ", "scores_string": "0.0926, 0.0450, 0.0136, 0.1533, 0.0693, 71.9483, 89.8246, 97.9610, 89.8246, 97.9610, 99.5050, 99.8342, 188.2653, "
which are the same as the paper's numbers.
Thanks again for your help!
Welcome! Glad it's all good.
Hi, this work is very interesting and amazing. However, when I download the ScanNet dataset and try to train my own model from scratch using the method mentioned in readme, I found I cannot reproduce the accuracy result of your paper(or your hero_model). I use 4 3090 instead of A100 with batch_size=8 in each gpu(in default, this training script uses 2 A100 with batch_size=16 in each gpu), and I don't modify other options. After training, I get this result:
"metrics_string": "abs_diff abs_rel sq_rel rmse rmse_log a5 a10 a25 a0 a1 a2 a3 model_time ", "scores_string": "0.1534, 0.0842, 0.0288, 0.2145, 0.1088, 45.6871, 71.1059, 93.1222, 71.1059, 93.1222, 98.7887, 99.7510, 118.0729, "
Are there any modification I need to do for this training scripts?