Training collapses even with small batch sizes

mramezani64 commented 2 years ago

Hi Jacek, Thank you for your contribution and releasing MinkLoc's code.

I try to reproduce the results reported in the paper for the Oxford dataset. However, the embedding vectors collapse after a few epochs. The version of the libraries/packages I am using is listed as follows. They are pretty close to what is recommended in the README.md. Also, I am on Ubuntu 20.04. Any insight to solve this issue is appreciated.

pytorch                   1.7.1           
python                    3.8.11          
minkowskiengine           0.4.3                   
pytorch-metric-learning   0.9.99                  
cudatoolkit               10.1.243

jac99 commented 2 years ago

Hi, I suspect the problem is caused by a current version of bitarray package. As a quick fix try to uninstall bitarray library and install version 1.6.0 (pip install bitarray==1.6.0). Also remove pickles with cached training triplets converted to bitarray format (training_queries_baseline_cached.pickle, training_queries_refine_cached.pickle) - they'll be automatically re-created by the training code. I've made a quick test and the training in the baseline scenario seems to work fine (python train.py --config ../config/config_baseline.txt --model_config ../models/minkloc3d.txt). The mean distance between an anchor and positive example is less than a mean distance between an anchor and a negative example and batch size increases from epoch 1. Expected output after 2 epochs should look like this:

Loading preprocessed query file: /home/jkomorowsk/PycharmProjects/benchmark_datasets/training_queries_baseline_cached.pickle...                                  
21711 queries in the dataset                                                    
Model name: model_MinkFPN_GeM_20210928_1232                                                                                                                      
Model class: MinkLoc   
Total parameters: 1055713                                                       
Backbone parameters: 1055712                                                    
Aggregation parameters: 1
Model device: cuda        
  0%|                                                                                                                                     | 0/40 [00:00<?, ?it/s]

train - Mean loss: 0.282287    Avg. embedding norm: 10.6563   Triplets per batch (all/non-zero): 16.0/10.3                                                       
Pos dist (min/mean/max): 0.8154/1.1814/1.6820   Neg dist (min/mean/max): 1.0289/1.3143/1.7992                                                                    
=> Batch size increased from: 16 to 22
  2%|███                                                                                                                       | 1/40 [06:37<4:18:07, 397.13s/it]
train - Mean loss: 0.207097    Avg. embedding norm: 5.9420   Triplets per batch (all/non-zero): 22.0/12.9                                                        
Pos dist (min/mean/max): 0.5034/0.7834/1.1862   Neg dist (min/mean/max): 0.7348/0.9591/1.3075                                                                    => Batch size increased from: 22 to 30

Let me know if this solved the problem.

mramezani64 commented 2 years ago

Hi, I suspect the problem is caused by a current version of bitarray package. As a quick fix try to uninstall bitarray library and install version 1.6.0 (pip install bitarray==1.6.0). Also remove pickles with cached training triplets converted to bitarray format (training_queries_baseline_cached.pickle, training_queries_refine_cached.pickle) - they'll be automatically re-created by the training code. I've made a quick test and the training in the baseline scenario seems to work fine (python train.py --config ../config/config_baseline.txt --model_config ../models/minkloc3d.txt). The mean distance between an anchor and positive example is less than a mean distance between an anchor and a negative example and batch size increases from epoch 1. Expected output after 2 epochs should look like this:
Loading preprocessed query file: /home/jkomorowsk/PycharmProjects/benchmark_datasets/training_queries_baseline_cached.pickle...                                  
21711 queries in the dataset                                                    
Model name: model_MinkFPN_GeM_20210928_1232                                                                                                                      
Model class: MinkLoc   
Total parameters: 1055713                                                       
Backbone parameters: 1055712                                                    
Aggregation parameters: 1
Model device: cuda        
  0%|                                                                                                                                     | 0/40 [00:00<?, ?it/s]

train - Mean loss: 0.282287    Avg. embedding norm: 10.6563   Triplets per batch (all/non-zero): 16.0/10.3                                                       
Pos dist (min/mean/max): 0.8154/1.1814/1.6820   Neg dist (min/mean/max): 1.0289/1.3143/1.7992                                                                    
=> Batch size increased from: 16 to 22
  2%|███                                                                                                                       | 1/40 [06:37<4:18:07, 397.13s/it]
train - Mean loss: 0.207097    Avg. embedding norm: 5.9420   Triplets per batch (all/non-zero): 22.0/12.9                                                        
Pos dist (min/mean/max): 0.5034/0.7834/1.1862   Neg dist (min/mean/max): 0.7348/0.9591/1.3075                                                                    => Batch size increased from: 22 to 30
Let me know if this solved the problem.

Thanks for this excellent diagnosis. It works now like a charm. It would be great to know why two various versions of bitarray make a big difference. The former version of bitarray in my env was 2.3.3. Is there any way I can make MinkLoc3D run with bitarray later than 1.6?

jac99 commented 2 years ago

I don't know why the code fails with the latest bitarray package. But today we released an updated version of MinkLoc3D code. Main changes include:

optimized training and evaluation pickle generation process
code was updated to work with the latest Pytorch release (1.9.1) and the latest MinkowskiEngine version 0.5.4

Format of training and evaluation pickles has changed - generation is much faster and no more dependency on the bitarray package. So you'll need to delete and recreate these pickles using scripts in generating_queries folder. The code works with the latest MinkowskiEngine 0.5.4 (and is not compatible with MinkowskiEngine 0.4.x) which is app. 2x faster. It's tested with CUDA 10.2 - there're some issues with CUDA 11.1.

jac99 commented 2 years ago

The issue seems to be solved

jac99 / MinkLoc3D

Training collapses even with small batch sizes #2