microsoft / computervision-recipes

Best Practices, code samples, and documentation for Computer Vision.
MIT License
9.54k stars 1.18k forks source link

[BUG] Cannot train object detection on CPU #618

Closed chjinche closed 4 years ago

chjinche commented 4 years ago

Description

DetectionLearner fit failed on CPU, which is incompatible with the doc description "a GPU is techically not required". See highlighted part in attached pic.

image

In which platform does it happen?

Linux. CPU

How do we replicate the issue?

Run following codes on CPU, no cuda.

from utils_cv.common.data import unzip_url
from utils_cv.detection.data import Urls as od_urls
from utils_cv.detection.dataset import DetectionDataset
from utils_cv.detection.model import (DetectionLearner, get_pretrained_fasterrcnn)

def od_detection_dataset():
    """ returns a basic detection dataset. """
    tmp_session = 'data'
    tiny_od_data_path = unzip_url(
        od_urls.fridge_objects_tiny_path,
        fpath=tmp_session,
        dest=tmp_session,
        exist_ok=True,
    )
    return DetectionDataset(tiny_od_data_path)

data = od_detection_dataset()
model = get_pretrained_fasterrcnn(
    num_classes=len(data.labels) + 1,
    min_size=100,
    max_size=200,
    rpn_pre_nms_top_n_train=500,
    rpn_pre_nms_top_n_test=250,
    rpn_post_nms_top_n_train=500,
    rpn_post_nms_top_n_test=250,
)
learner = DetectionLearner(data, model=model)
learner.fit(epochs=1)

Got such error:

Epoch: [0]  [ 0/10]  eta: 0:04:00  lr: 0.000560  loss: 1.9363 (1.9363)  loss_classifier: 1.6700 (1.6700)  loss_box_reg: 0.0109 (0.0109)  loss_objectness: 0.2313 (0.2313)  loss_rpn_box_reg: 0.0241 (0.0241)  time: 24.0076  data: 0.1438
Epoch: [0]  [ 9/10]  eta: 0:00:05  lr: 0.005000  loss: 0.1795 (0.6579)  loss_classifier: 0.0431 (0.5310)  loss_box_reg: 0.0002 (0.0013)  loss_objectness: 0.0809 (0.1081)  loss_rpn_box_reg: 0.0135 (0.0175)  time: 5.3490  data: 0.0173
Epoch: [0] Total time: 0:00:53 (5.3514 s / it)
creating index...
index created!
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=50 error=100 : no CUDA-capable device is detected
Traceback (most recent call last):
  File "bug_cpu_case.py", line 30, in <module>
    learner.fit(epochs=1)
  File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/utils_cv/detection/model.py", line 543, in fit
    e = self.evaluate(dl=self.dataset.test_dl)
  File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/utils_cv/detection/model.py", line 584, in evaluate
    self.results = evaluate(self.model, dl, device=self.device)
  File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
    return func(*args, **kwargs)
  File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/utils_cv/detection/references/engine.py", line 88, in evaluate
    torch.cuda.synchronize()
  File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/torch/cuda/__init__.py", line 398, in synchronize
    _lazy_init()
  File "/mnt/chjinche/miniconda3/envs/py37/lib/python3.7/site-packages/torch/cuda/__init__.py", line 193, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (100) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:50

Expected behavior (i.e. solution)

DetectionLearner fit on CPU should run successfully.

Other Comments

PatrickBue commented 4 years ago

Evaluating accuracy on the test set requires GPU, training does not. Evaluation can be switched of using detector.fit(epochs=1, skip_evaluation=skip_evaluation). See also this notebook: https://github.com/microsoft/computervision-recipes/blob/master/scenarios/detection/01_training_introduction.ipynb

chjinche commented 4 years ago

@PatrickBue Thanks for quick reply! However, skip_evaluation will make it impossible to find best model and early stop training, which are based on model perf on validation dataset. Any suggestion about this problem?

PatrickBue commented 4 years ago

Unfortunately the library pycocotools (which this repo and torchvision use) requires GPU. One way around that could be to find a library which works on CPU-only and then manually call that library to compute mAP numbers.

chjinche commented 4 years ago

I see. Thanks again for your help @PatrickBue

yvvnx commented 4 years ago

Ok.

yvvnx commented 4 years ago

Ok.