Running Error - Githubissues

FengLoveBella commented 6 years ago

When I am running your code, and I encounter the following error,

config:Namespace(batch_size=1, checkpoint_folder='./checkpoint', cls_num=20, epochs=150, lr=1e-07, lr_steps=[10000, 20000, 30000, 40000], momentum=0.9, multigpu=False, pretrained_model='', print_freq=1, resume_model='', start_epoch=0, weight_decay=0.0005, workers=16) ('train_dataset len', 42490) Totally new layer:score_edge_side1 Totally new layer:score_edge_side2 Totally new layer:score_edge_side3 Totally new layer:score_cls_side5 Totally new layer:ce_fusion label_name: label_name: Traceback (most recent call last): File "/home/fengzhou/CASENet/main.py", line 129, in main() File "/home/fengzhou/CASENet/main.py", line 84, in main global_step = model_play.train(args, train_loader, model, optimizer, epoch, curr_lr, win_feats5, win_fusion, viz, global_step) File "/home/fengzhou/CASENet/train_val/model_play.py", line 31, in train for i, (img, target) in enumerate(train_loader): File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 281, in next return self._process_next_batch(batch) File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 301, in _process_next_batch raise batch.exc_type(batch.exc_msg) KeyError: 'Traceback (most recent call last):\n File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 55, in _worker_loop\n samples = collate_fn([dataset[i] for i in batch_indices])\n File "/home/fengzhou/CASENet/dataloader/SBD_data.py", line 61, in getitem\n np_data = self.h5_f[\'data/\'+labelname.replace(\'/\', \'\').replace(\'bin\', \'npy\')]\n File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/_objects.c:2577)\n File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/_objects.c:2536)\n File "/usr/lib/python2.7/dist-packages/h5py/_hl/group.py", line 166, in getitem\n oid = h5o.open(self.id, self._e(name), lapl=self._lapl)\n File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/_objects.c:2577)\n File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/_objects.c:2536)\n File "h5py/h5o.pyx", line 190, in h5py.h5o.open (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/h5o.c:3407)\nKeyError: \'Unable to open object (Bad object header version number)\'\n'

Did you encounter it before?? Thank you very much. @lijiaman

lijiaman commented 6 years ago

seems issues about workers. set workers to 1 may fix it. (I'd suggest to change data input as image format instead of hdf5. multiple workers seems causing issues for hdf5 file)

FengLoveBella commented 6 years ago

@lijiaman Yes, I set workers to 1, it is ok now, but the total loss is extremely high, about 2000000, and there is no trend to decrease, it is normal?

FengLoveBella commented 6 years ago

@lijiaman screenshot from 2018-07-17 17-32-25 I follow your code, and I encounter this bug, and I am not sure it is a bug of training dataset or a bug of training network. I am looking forward to your reply.

mengxingkong commented 5 years ago

@zhoufengbuaa Hai, recently, I have to reproduce CASENET, when i run this repository, I also meet the problem(learning rate is much high), had you resolved it? Hope that you can help me. Looking forward to your reply. Screenshot from 2019-04-15 16-19-08

shoutOutYangJie commented 5 years ago

@zhoufengbuaa Hai, recently, I have to reproduce CASENET, when i run this repository, I also meet the problem(learning rate is much high), had you resolved it? Hope that you can help me. Looking forward to your reply.

hi, have you been tested it? how does it perform?

lijiaman / CASENet

Running Error #2