mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.6k stars 553 forks source link

[SSD] TypeError during COCOeval #540

Closed fzqneo closed 2 years ago

fzqneo commented 2 years ago

This error occurred after the first epoch training finished, when running the docker single-GPU experiment.

Traceback (most recent call last): File "train.py", line 266, in main(args) File "train.py", line 251, in main coco_evaluator = evaluate(model, data_loader_test, device=device, epoch=epoch, args=args) File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/workspace/single_stage_detector/ssd/engine.py", line 78, in evaluate coco_evaluator = CocoEvaluator(coco, iou_types) File "/workspace/single_stage_detector/ssd/coco_eval.py", line 28, in init self.coco_eval[iou_type] = COCOeval(coco_gt, iouType=iou_type) File "/opt/conda/lib/python3.7/site-packages/pycocotools/cocoeval.py", line 76, in init self.params = Params(iouType=iouType) # parameters File "/opt/conda/lib/python3.7/site-packages/pycocotools/cocoeval.py", line 527, in init self.setDetParams() File "/opt/conda/lib/python3.7/site-packages/pycocotools/cocoeval.py", line 507, in setDetParams self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True) File "<__array_function__ internals>", line 6, in linspace File "/opt/conda/lib/python3.7/site-packages/numpy/core/function_base.py", line 120, in linspace num = operator.index(num) TypeError: 'numpy.float64' object cannot be interpreted as an integer

emizan76 commented 2 years ago

John, we are seeing this failure in the SSD reference, I am trying to assign it to Ahmad but I cannot (cannot understand why). Any help will be appreciated.

johntran-nv commented 2 years ago

@ahmadki is going to look at this.

ahmadki commented 2 years ago

@fzqneo @emizan76

I'm unable to reproduce the bug unless I corrupt open-images-v6/validation/labels/openimages-mlperf.json Can you re-create the file and try again ? I also uploaded a copy of my files to the task force GDrive folder here if you want to try them instead.

https://github.com/mlcommons/training/pull/544 Includes a simplified CocoEvaluator implementation if you want to try it, but it doesn't help with the error you are seeing.

You can try these changes without training the model, this is a minimal code to reproduce (run from withing the SSD folder):

import torch
import torch.utils.data

import presets
from coco_utils import get_coco_api_from_dataset, get_openimages
from coco_eval import DefaultCocoEvaluator
import utils

DATASET_PATH = "/datasets/open-images-v6" # < == Change me ?
BATCH_SIZE = 32

dataset_test = get_openimages(name="openimages-mlperf",
                              root=DATASET_PATH,
                              image_set="val",
                              transforms=presets.DetectionPresetEval())
test_sampler = torch.utils.data.SequentialSampler(dataset_test)
data_loader_test = torch.utils.data.DataLoader(
    dataset_test,
    batch_size=BATCH_SIZE,
    sampler=test_sampler,
    num_workers=4,
    pin_memory=True,
    collate_fn=utils.collate_fn)

coco = get_coco_api_from_dataset(data_loader_test.dataset)
coco_evaluator = DefaultCocoEvaluator(coco, ["bbox"])

Please let me know how it goes.

Thank you

fujitsu-notsu commented 2 years ago

I encountered same error and resolved after upgrading pycocotools from 2.0.0 to 2.0.4 in requirements.txt.

The reason of above error is the following code in pycocotools 2.0.0.

self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)

The third parameter np.round returns np.float32 whereas np.linspace requires integer type and it will raise exception because of type mismatch.

This function has output warning message without exception at the above condition until numpy version 1.17 and raised exception since version 1.18. This is why some people cannot reproduce the error.

ahmadki commented 2 years ago

Thank you @fujitsu-notsu for helping identify the issue. I was able to reproduce the error after updating my numpy version. Updating pycocotools to 2.0.4 solved the error afterwards.

pycocotools 2.0.4 was released after the benchmark code has been finalized, I'll make sure to update the dependencies for v2.1.

fzqneo commented 2 years ago

Thanks for tracing the source of the problem, @fujitsu-notsu Updating to pycocotools 2.0.4 fixed the problem for me.

johntran-nv commented 2 years ago

Closing as resolved.