kernel died(notebook occupying vram that can't be released)

mobassir94 commented 3 years ago

i am facing strange problem not all model of this repo but there are some models in this repo like dbnet,drrg if i train them,it works, but say during training after some moment i decided, "well i want to train another model", so then if i stop training by clicking restart and clear output/interrupt then i will get pop up message saying "kernel died". after that no matter what i do i can't release vram,,nvidia-smi always shows up that vram is in use but i can't free them anymore,,the only solution for me is to reboot the system which is painful solution,how do i solve this memory management issue or the problem of jupyter-notebook eating vram silently and not releasing or freeing memory even after restarting notebook? any solution please? i don't want to restart my computer over and over again,,any python code or linux command to get rid of this error will be highly appreciated or could you solve this issue directly from your api? i realized not all model has this issue with jupyter notebook and i know using other ide like spyder and .py file instead of notebook might will solve this issue but i want to learn how to solve this issue for jupyter notebook instead as i prefer using notebooks,thank you

innerlee commented 3 years ago

Notebook is good for learning & demo, but not for actual training. Please use tools/train.py to start real training.

Edit: or at least mimic codes of train.py in the notebook

mobassir94 commented 3 years ago

i used .py file now for training and i am getting similar memory leak error like before(i use spyder) :

An error ocurred while starting the kernel Traceback (most recent call last): File "/home/apsisdev/anaconda3/envs/ocr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/apsisdev/anaconda3/envs/ocr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/apsisdev/anaconda3/envs/ocr/lib/python3.8/site‑packages/spyder_kernels/console/main.py", line 23, in start.main() File "/home/apsisdev/anaconda3/envs/ocr/lib/python3.8/site‑packages/spyder_kernels/console/start.py", line 284, in main kernel.initialize() File "/home/apsisdev/anaconda3/envs/ocr/lib/python3.8/site‑packages/traitlets/config/application.py", line 87, in inner return method(app, *args, **kwargs) File "/home/apsisdev/anaconda3/envs/ocr/lib/python3.8/site‑packages/ipykernel/kernelapp.py", line 574, in initialize self.init_sockets() File "/home/apsisdev/anaconda3/envs/ocr/lib/python3.8/site‑packages/ipykernel/kernelapp.py", line 271, in init_sockets self.shell_port = self._bind_socket(self.shell_socket, self.shell_port) File "/home/apsisdev/anaconda3/envs/ocr/lib/python3.8/site‑packages/ipykernel/kernelapp.py", line 218, in _bind_socket return self._try_bind_socket(s, port) File "/home/apsisdev/anaconda3/envs/ocr/lib/python3.8/site‑packages/ipykernel/kernelapp.py", line 194, in _try_bind_socket s.bind("tcp://%s:%i" % (self.ip, port)) File "/home/apsisdev/anaconda3/envs/ocr/lib/python3.8/site‑packages/zmq/sugar/socket.py", line 172, in bind super().bind(addr) File "zmq/backend/cython/socket.pyx", line 540, in zmq.backend.cython.socket.Socket.bind File "zmq/backend/cython/checkrc.pxd", line 28, in zmq.backend.cython.checkrc._check_rc zmq.error.ZMQError: Address already in use

i believe there is something wrong in mmocr memory management

mobassir94 commented 3 years ago

here is the code i used in a .py file instead of ipynb notebook :


# -*- coding: utf-8 -*-
"""
Created on Thu Jun 24 15:44:41 2021

@author: apsisdev
"""

import torch
import gc
from numba import cuda 

torch.cuda.empty_cache()
gc.collect()
torch.cuda.is_available()
#device = cuda.get_current_device()
#device.reset()

#!nvidia-smi
#ls "/home/apsisdev/IMPORTANT/"
''' 
#!mkdir totaltext 
#cd totaltext
#!mkdir imgs && mkdir annotations

#exact total text format : 

    ls /home/apsisdev/IMPORTANT/bn_totaltext/Annotation/groundtruth_polygonal_annotation
    !cp -r /home/apsisdev/IMPORTANT/bn_totaltext/Train /home/apsisdev/IMPORTANT/mmocr/demo/totaltext/imgs/training
    !cp -r /home/apsisdev/IMPORTANT/bn_totaltext/Test /home/apsisdev/IMPORTANT/mmocr/demo/totaltext/imgs/test
    !cp -r /home/apsisdev/IMPORTANT/bn_totaltext/Annotation/groundtruth_polygonal_annotation/Train /home/apsisdev/IMPORTANT/mmocr/demo/totaltext/annotations/training
    !cp -r /home/apsisdev/IMPORTANT/bn_totaltext/Annotation/groundtruth_polygonal_annotation/Test /home/apsisdev/IMPORTANT/mmocr/demo/totaltext/annotations/test
    !python /home/apsisdev/IMPORTANT/mmocr/tools/data/textdet/totaltext_converter.py /home/apsisdev/IMPORTANT/mmocr/demo/totaltext/ -o /home/apsisdev/IMPORTANT/mmocr/demo/totaltext/ --split-list training test

#non total text structure : 

#ls /home/apsisdev/IMPORTANT/banglaDetHor/test/
#!cp -r /home/apsisdev/IMPORTANT/banglaDetHor/train/images /home/apsisdev/IMPORTANT/mmocr/demo/totaltext/imgs/training
#!cp -r /home/apsisdev/IMPORTANT/banglaDetHor/test/images /home/apsisdev/IMPORTANT/mmocr/demo/totaltext/imgs/test
#!cp -r /home/apsisdev/IMPORTANT/banglaDetHor/train/annotations /home/apsisdev/IMPORTANT/mmocr/demo/totaltext/annotations/training
#!cp -r /home/apsisdev/IMPORTANT/banglaDetHor/test/annotations /home/apsisdev/IMPORTANT/mmocr/demo/totaltext/annotations/test
#!python /home/apsisdev/IMPORTANT/mmocr/tools/data/textdet/totaltext_converter.py /home/apsisdev/IMPORTANT/mmocr/demo/totaltext/ -o /home/apsisdev/IMPORTANT/mmocr/demo/totaltext/ --split-list training test
#!python /home/apsisdev/IMPORTANT/mmocr/tools/data/utils/txt2lmdb.py -i /home/apsisdev/IMPORTANT/banglasynth/label.txt -o /home/apsisdev/IMPORTANT/banglasynth/label.lmdb

'''

import mmcv
import matplotlib.pyplot as plt 

img = mmcv.imread('/home/apsisdev/IMPORTANT/banglasynthShort/imgs/1.png')
plt.imshow(mmcv.bgr2rgb(img))
plt.show()

#ls "/home/apsisdev/IMPORTANT/totaltext"
#ls "/home/apsisdev/IMPORTANT/mmocr/configs/textdet/"

'''
# models
1. textsnake -> /textsnake/textsnake_r50_fpn_unet_1200e_ctw1500.py
2. dbnet18 -> /dbnet/dbnet_r18_fpnc_1200e_icdar2015.py
3. dbnet50 -> /dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py
4. drrg -> /drrg/drrg_r50_fpn_unet_1200e_ctw1500.py

5. fcenet -> /fcenet/fcenet_r50dcnv2_fpn_1500e_ctw1500.py
6. fcenet -> /fcenet/fcenet_r50_fpn_1500e_icdar2015.py
7. maskrcnn -> /maskrcnn/mask_rcnn_r50_fpn_160e_ctw1500.py
8. maskrcnn -> /maskrcnn/mask_rcnn_r50_fpn_160e_icdar2015.py
9. maskrcnn -> /maskrcnn/mask_rcnn_r50_fpn_160e_icdar2017.py
10. panet -> /panet/panet_r50_fpem_ffm_600e_icdar2017.py
11. panet -> /panet/panet_r18_fpem_ffm_600e_ctw1500.py    
12. panet -> /panet/panet_r18_fpem_ffm_600e_icdar2015.py
13. psenet -> /psenet/psenet_r50_fpnf_600e_icdar2015.py
14. psenet -> /psenet/psenet_r50_fpnf_600e_ctw1500.py
15. psenet -> /psenet/psenet_r50_fpnf_600e_icdar2017.py
'''

from mmcv import Config
cfg = Config.fromfile('/home/apsisdev/IMPORTANT/mmocr/configs/textdet/maskrcnn/mask_rcnn_r50_fpn_160e_icdar2015.py') 
#cfg.data_root = "/home/apsisdev/IMPORTANT/totaltext"

#cfg.checkpoint_config.interval  = 512
#cfg.evaluation.interval = 1

from mmdet.apis import set_random_seed

# Set up working dir to save files and logs.
cfg.work_dir = '../outputs/detection_model'

cfg.data.samples_per_gpu = 8

# The original learning rate (LR) is set for 8-GPU training.
# We divide it by 8 since we only use one GPU.
cfg.optimizer.lr = 0.001 / 8
cfg.lr_config.warmup = None
# Choose to log training results every 128 images to reduce the size of log file. 
cfg.log_config.interval = 1

# Set seed thus the results are more reproducible
cfg.seed = 0
set_random_seed(0, deterministic=False)
cfg.gpu_ids = range(1)

# We can initialize the logger for training and have a look
# at the final config used for training
#
#cfg.label_convertor["dict_file"]="/home/apsisdev/IMPORTANT/banglasynthShort/bangla_dict.txt"
print(f'Config:\n{cfg.pretty_text}')

#ls "/home/apsisdev/IMPORTANT/banglasynth/"
#cfg.data.train.datasets[0]

from mmocr.datasets import build_dataset
from mmocr.models import build_detector
from mmocr.apis import train_detector
import os.path as osp
cfg.total_epochs = 20
# Build dataset
datasets = [build_dataset(cfg.data.train)]

# Build the detector
model = build_detector(
    cfg.model, train_cfg=cfg.get('train_cfg'), test_cfg=cfg.get('test_cfg'))
# Add an attribute for visualization convenience
#model.CLASSES = datasets[0].CLASSES

cfg.load_from = None

# Create work_dir
mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))

'''
start training..........
'''

if __name__ == '__main__': 
    train_detector(model, datasets, cfg, distributed=False, validate=True)

torch.cuda.empty_cache()
gc.collect()

innerlee commented 3 years ago

Ohh... Please run train.py in a normal terminal (unrelated to spyder). See https://mmocr.readthedocs.io/en/latest/getting_started.html#train-a-model for how to train a model.

The notebook thing is for demostration only.

mobassir94 commented 3 years ago

spyder is a kernel but i am running inside : if name == 'main':

so it should solve the issue of multiprocessing,no?

innerlee commented 3 years ago

First thing to try is to open a normal terminal. Type

python tools/train.py YOUR_CONFIG_FILE --work_dir work_dirs/any_name_you_like

open-mmlab / mmocr

kernel died(notebook occupying vram that can't be released) #321