open-mmlab / mmocr

OpenMMLab Text Detection, Recognition and Understanding Toolbox
https://mmocr.readthedocs.io/en/dev-1.x/
Apache License 2.0
4.27k stars 743 forks source link

NaN losses for training FCENet on ICDAR2015 #332

Closed silent357 closed 3 years ago

silent357 commented 3 years ago

Hi, I train FCENet on ICDAR2015 but get NaN for all the losses. The training seems proper on CTW1500. Is there anything wrong?

gaotongxiao commented 3 years ago

Did you reinstall mmcv after you've updated PyTorch? Also, rerunning the training process might help as sometimes bad initialization may occur.

innerlee commented 3 years ago

It seems that FCENet needs more tests

silent357 commented 3 years ago

Did you reinstall mmcv after you've updated PyTorch? Also, rerunning the training process might help as sometimes bad initialization may occur.

Yes, I reinstalled mmcv after updating PyTorch. The NaN losses occur every time I run the FCENet training on ICDAR2015.

silent357 commented 3 years ago

It seems that FCENet needs more tests

Maybe.

gaotongxiao commented 3 years ago

Are you using mmdet==2.13.0? We only support mmdet==2.11.0 at this time but will fix it soon.

silent357 commented 3 years ago

I change the version of mmdet to 2.11.0 and it works! Thank you!

silent357 commented 3 years ago

I have another question. I find that when I train an fcenet model with 2gpus, I cannot train other models with multiple gpus. Sometimes I cannot either test an fcenet model. However, it is okay to train a model with 1 gpu when I train an fcenet model with 2 gpus. The error is as below:

2

innerlee commented 3 years ago

You can modify the port number at https://github.com/open-mmlab/mmocr/blob/main/tools/dist_train.sh#L13

btw please post a new issue for new questions

innerlee commented 3 years ago

@gaotongxiao if you have interest, the default port number can be made random. Something like $((12000 + $RANDOM % 20000))

silent357 commented 3 years ago

OK. Thank you!

liangxiaoyun commented 2 years ago

When I install mmdet==2.11.0, it occur an error: ModuleNotFoundError: No module named 'mmdet.datasets.api_wrappers'

gaotongxiao commented 2 years ago

@liangxiaoyun mmdet 2.11.0 has not been officially supported since 0.2.1. Please make sure your mmdet version fits the requirement here