samleoqh / MSCG-Net

Multi-view Self-Constructing Graph Convolutional Networks with Adaptive Class Weighting Loss for Semantic Segmentation
MIT License
68 stars 28 forks source link

Training: failing after 1 epoch #19

Closed czarmanu closed 2 years ago

czarmanu commented 2 years ago

PYTHONPATH={PWD}/..:${PYTHONPATH} CUDA_VISIBLE_DEVICES=0 python3 ./tools/train_R101.py train set ------- 12901 val set --------- 4431 ---curr_iter: 0, numiter per epoch: 1843--- [epoch 1], [iter 100 / 1843], [loss 1.57985, aux 0.02329, cls 0.00000], [lr 0.0001257769], [time 143.065] [epoch 1], [iter 200 / 1843], [loss 1.42913, aux 0.02067, cls 0.00000], [lr 0.0001257769], [time 137.985] [epoch 1], [iter 300 / 1843], [loss 1.37367, aux 0.01925, cls 0.00000], [lr 0.0001257769], [time 129.393] [epoch 1], [iter 400 / 1843], [loss 1.32586, aux 0.01855, cls 0.00000], [lr 0.0001257769], [time 127.114] [epoch 1], [iter 500 / 1843], [loss 1.29186, aux 0.01806, cls 0.00000], [lr 0.0001257769], [time 128.600] [epoch 1], [iter 600 / 1843], [loss 1.27088, aux 0.01794, cls 0.00000], [lr 0.0001257769], [time 128.882] [epoch 1], [iter 700 / 1843], [loss 1.24876, aux 0.01756, cls 0.00000], [lr 0.0001257769], [time 127.767] [epoch 1], [iter 800 / 1843], [loss 1.22931, aux 0.01727, cls 0.00000], [lr 0.0001257769], [time 127.664] [epoch 1], [iter 900 / 1843], [loss 1.21001, aux 0.01702, cls 0.00000], [lr 0.0001257769], [time 126.604] [epoch 1], [iter 1000 / 1843], [loss 1.19872, aux 0.01680, cls 0.00000], [lr 0.0001257769], [time 127.504] [epoch 1], [iter 1100 / 1843], [loss 1.18356, aux 0.01665, cls 0.00000], [lr 0.0001257769], [time 127.555] [epoch 1], [iter 1200 / 1843], [loss 1.17377, aux 0.01643, cls 0.00000], [lr 0.0001257769], [time 127.067] [epoch 1], [iter 1300 / 1843], [loss 1.16567, aux 0.01621, cls 0.00000], [lr 0.0001257769], [time 126.639] [epoch 1], [iter 1400 / 1843], [loss 1.15765, aux 0.01598, cls 0.00000], [lr 0.0001257769], [time 127.105] [epoch 1], [iter 1500 / 1843], [loss 1.14973, aux 0.01582, cls 0.00000], [lr 0.0001257769], [time 127.397] [epoch 1], [iter 1600 / 1843], [loss 1.14039, aux 0.01567, cls 0.00000], [lr 0.0001257769], [time 126.448] [epoch 1], [iter 1700 / 1843], [loss 1.13372, aux 0.01560, cls 0.00000], [lr 0.0001257769], [time 126.950] [epoch 1], [iter 1800 / 1843], [loss 1.12679, aux 0.01549, cls 0.00000], [lr 0.0001257769], [time 126.361] [ WARN:0@2483.905] global /io/opencv/modules/imgcodecs/src/loadsave.cpp (239) findDecoder imread('/home/pf/pfstaff/projects/mtom/mTom_Sat_Data_Fusion/Comparison/supervised/Agriculture-Vision/val/gt/X3ZIXRGRL_2736-1806-3248-2318.png'): can't open/read file: check file path/integrity Traceback (most recent call last): File "./tools/train_R101.py", line 252, in main() File "./tools/train_R101.py", line 146, in main validate(net, val_set, val_loader, criterion, optimizer, start_epoch + new_ep, new_ep) File "./tools/train_R101.py", line 157, in validate for vi, (inputs, gts) in enumerate(val_loader): File "/scratch/manu/MSCG-Net-master/venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in next data = self._next_data() File "/scratch/manu/MSCG-Net-master/venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/scratch/manu/MSCG-Net-master/venv/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/scratch/manu/MSCG-Net-master/venv/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/scratch/manu/MSCG-Net-master_selftrained/data/AgricultureVision/loader.py", line 53, in getitem label = imload(self.mask_files[idx], gray=True, scale_rate=self.scale) File "/scratch/manu/MSCG-Net-master_selftrained/data/augmt.py", line 59, in imload image = np.asarray(image, dtype='uint8') File "/scratch/manu/MSCG-Net-master/venv/lib/python3.6/site-packages/numpy/core/numeric.py", line 501, in asarray return array(a, dtype, copy=False, order=order) TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

samleoqh commented 2 years ago

Make sure this file exists: ./val/gt/X3ZIXRGRL_2736-1806-3248-2318.png

czarmanu commented 2 years ago

It does not exist. But, shouldn't that be automatically made while running train_R101.py for the first time?

czarmanu commented 2 years ago

A lot of other gt images are created while running the train script for the first time. however, not the above file

samleoqh commented 2 years ago

Yes, only in the first time running, the /val/gt and /train/gt folders will be generated. You could double check if the file with name pattern : "X3ZIXRGRL_2736-1806-3248-2318" exists in the val/labels folder or not, becasue the gt files were just created using the same names of its orgininal labels. If the lable file exists while the gt file not exists, there might be some errors or missing during generating gt files. You might need delete val/gt whole folder, and then run the trianing again, the val/gt will be re-generated during training.

samleoqh commented 2 years ago

by the way, the number of files in val/images/rgb should be equal to the number of files in val/gt folder, otherwise, there were some missing gt files failed to generated.

czarmanu commented 2 years ago

Yes, only in the first time running, the /val/gt and /train/gt folders will be generated. You could double check if the file with name pattern : "X3ZIXRGRL_2736-1806-3248-2318" exists in the val/labels folder or not, becasue the gt files were just created using the same names of its orgininal labels. If the lable file exists while the gt file not exists, there might be some errors or missing during generating gt files. You might need delete val/gt whole folder, and then run the trianing again, the val/gt will be re-generated during training.

Deleting the folder and re-running worked. Thanks!