Closed fzqneo closed 2 years ago
John, we are seeing this failure in the SSD reference, I am trying to assign it to Ahmad but I cannot (cannot understand why). Any help will be appreciated.
@ahmadki is going to look at this.
@fzqneo @emizan76
I'm unable to reproduce the bug unless I corrupt open-images-v6/validation/labels/openimages-mlperf.json
Can you re-create the file and try again ? I also uploaded a copy of my files to the task force GDrive folder here if you want to try them instead.
https://github.com/mlcommons/training/pull/544 Includes a simplified CocoEvaluator implementation if you want to try it, but it doesn't help with the error you are seeing.
You can try these changes without training the model, this is a minimal code to reproduce (run from withing the SSD folder):
import torch
import torch.utils.data
import presets
from coco_utils import get_coco_api_from_dataset, get_openimages
from coco_eval import DefaultCocoEvaluator
import utils
DATASET_PATH = "/datasets/open-images-v6" # < == Change me ?
BATCH_SIZE = 32
dataset_test = get_openimages(name="openimages-mlperf",
root=DATASET_PATH,
image_set="val",
transforms=presets.DetectionPresetEval())
test_sampler = torch.utils.data.SequentialSampler(dataset_test)
data_loader_test = torch.utils.data.DataLoader(
dataset_test,
batch_size=BATCH_SIZE,
sampler=test_sampler,
num_workers=4,
pin_memory=True,
collate_fn=utils.collate_fn)
coco = get_coco_api_from_dataset(data_loader_test.dataset)
coco_evaluator = DefaultCocoEvaluator(coco, ["bbox"])
Please let me know how it goes.
Thank you
I encountered same error and resolved after upgrading pycocotools from 2.0.0 to 2.0.4 in requirements.txt.
The reason of above error is the following code in pycocotools 2.0.0.
self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)
The third parameter np.round
returns np.float32
whereas np.linspace
requires integer type and it will raise exception because of type mismatch.
This function has output warning message without exception at the above condition until numpy version 1.17 and raised exception since version 1.18. This is why some people cannot reproduce the error.
Thank you @fujitsu-notsu for helping identify the issue. I was able to reproduce the error after updating my numpy version. Updating pycocotools to 2.0.4
solved the error afterwards.
pycocotools 2.0.4
was released after the benchmark code has been finalized, I'll make sure to update the dependencies for v2.1
.
Thanks for tracing the source of the problem, @fujitsu-notsu Updating to pycocotools 2.0.4 fixed the problem for me.
Closing as resolved.
This error occurred after the first epoch training finished, when running the docker single-GPU experiment.