jac99 / MinkLoc3D

MinkLoc3D: Point Cloud Based Large-Scale Place Recognition
MIT License
122 stars 19 forks source link

Training collapses even with small batch sizes #2

Closed mramezani64 closed 2 years ago

mramezani64 commented 2 years ago

Hi Jacek, Thank you for your contribution and releasing MinkLoc's code.

I try to reproduce the results reported in the paper for the Oxford dataset. However, the embedding vectors collapse after a few epochs. The version of the libraries/packages I am using is listed as follows. They are pretty close to what is recommended in the README.md. Also, I am on Ubuntu 20.04. Any insight to solve this issue is appreciated.

pytorch                   1.7.1           
python                    3.8.11          
minkowskiengine           0.4.3                   
pytorch-metric-learning   0.9.99                  
cudatoolkit               10.1.243
jac99 commented 2 years ago

Hi, I suspect the problem is caused by a current version of bitarray package. As a quick fix try to uninstall bitarray library and install version 1.6.0 (pip install bitarray==1.6.0). Also remove pickles with cached training triplets converted to bitarray format (training_queries_baseline_cached.pickle, training_queries_refine_cached.pickle) - they'll be automatically re-created by the training code. I've made a quick test and the training in the baseline scenario seems to work fine (python train.py --config ../config/config_baseline.txt --model_config ../models/minkloc3d.txt). The mean distance between an anchor and positive example is less than a mean distance between an anchor and a negative example and batch size increases from epoch 1. Expected output after 2 epochs should look like this:

Loading preprocessed query file: /home/jkomorowsk/PycharmProjects/benchmark_datasets/training_queries_baseline_cached.pickle...                                  
21711 queries in the dataset                                                    
Model name: model_MinkFPN_GeM_20210928_1232                                                                                                                      
Model class: MinkLoc   
Total parameters: 1055713                                                       
Backbone parameters: 1055712                                                    
Aggregation parameters: 1
Model device: cuda        
  0%|                                                                                                                                     | 0/40 [00:00<?, ?it/s]

train - Mean loss: 0.282287    Avg. embedding norm: 10.6563   Triplets per batch (all/non-zero): 16.0/10.3                                                       
Pos dist (min/mean/max): 0.8154/1.1814/1.6820   Neg dist (min/mean/max): 1.0289/1.3143/1.7992                                                                    
=> Batch size increased from: 16 to 22
  2%|███                                                                                                                       | 1/40 [06:37<4:18:07, 397.13s/it]
train - Mean loss: 0.207097    Avg. embedding norm: 5.9420   Triplets per batch (all/non-zero): 22.0/12.9                                                        
Pos dist (min/mean/max): 0.5034/0.7834/1.1862   Neg dist (min/mean/max): 0.7348/0.9591/1.3075                                                                    => Batch size increased from: 22 to 30

Let me know if this solved the problem.

mramezani64 commented 2 years ago

Hi, I suspect the problem is caused by a current version of bitarray package. As a quick fix try to uninstall bitarray library and install version 1.6.0 (pip install bitarray==1.6.0). Also remove pickles with cached training triplets converted to bitarray format (training_queries_baseline_cached.pickle, training_queries_refine_cached.pickle) - they'll be automatically re-created by the training code. I've made a quick test and the training in the baseline scenario seems to work fine (python train.py --config ../config/config_baseline.txt --model_config ../models/minkloc3d.txt). The mean distance between an anchor and positive example is less than a mean distance between an anchor and a negative example and batch size increases from epoch 1. Expected output after 2 epochs should look like this:

Loading preprocessed query file: /home/jkomorowsk/PycharmProjects/benchmark_datasets/training_queries_baseline_cached.pickle...                                  
21711 queries in the dataset                                                    
Model name: model_MinkFPN_GeM_20210928_1232                                                                                                                      
Model class: MinkLoc   
Total parameters: 1055713                                                       
Backbone parameters: 1055712                                                    
Aggregation parameters: 1
Model device: cuda        
  0%|                                                                                                                                     | 0/40 [00:00<?, ?it/s]

train - Mean loss: 0.282287    Avg. embedding norm: 10.6563   Triplets per batch (all/non-zero): 16.0/10.3                                                       
Pos dist (min/mean/max): 0.8154/1.1814/1.6820   Neg dist (min/mean/max): 1.0289/1.3143/1.7992                                                                    
=> Batch size increased from: 16 to 22
  2%|███                                                                                                                       | 1/40 [06:37<4:18:07, 397.13s/it]
train - Mean loss: 0.207097    Avg. embedding norm: 5.9420   Triplets per batch (all/non-zero): 22.0/12.9                                                        
Pos dist (min/mean/max): 0.5034/0.7834/1.1862   Neg dist (min/mean/max): 0.7348/0.9591/1.3075                                                                    => Batch size increased from: 22 to 30

Let me know if this solved the problem.

Thanks for this excellent diagnosis. It works now like a charm. It would be great to know why two various versions of bitarray make a big difference. The former version of bitarray in my env was 2.3.3. Is there any way I can make MinkLoc3D run with bitarray later than 1.6?

jac99 commented 2 years ago

I don't know why the code fails with the latest bitarray package. But today we released an updated version of MinkLoc3D code. Main changes include:

Format of training and evaluation pickles has changed - generation is much faster and no more dependency on the bitarray package. So you'll need to delete and recreate these pickles using scripts in generating_queries folder. The code works with the latest MinkowskiEngine 0.5.4 (and is not compatible with MinkowskiEngine 0.4.x) which is app. 2x faster. It's tested with CUDA 10.2 - there're some issues with CUDA 11.1.

jac99 commented 2 years ago

The issue seems to be solved