mlcommons / inference_policies

Issues related to MLPerf™ Inference policies, including rules and suggested changes
https://mlcommons.org/en/groups/inference/
Apache License 2.0
55 stars 52 forks source link

Update the open dataset requirement #285

Closed psyhtest closed 10 months ago

github-actions[bot] commented 10 months ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

psyhtest commented 10 months ago

Justifying the removed text

From v3.0, if a submitter provides any results with any models trained on a pre-approved dataset, the submitter must also provide at least one result with the corresponding Closed model trained (or finetuned) on the same pre-approved dataset, and instructions to reproduce the training (or finetuning) process.

I recall we introduced this specifically for RetinaNet just before introducing it in v2.1. At the time, the RetinaNet dataset, an MLPerf subset of OpenImages, was only used for benchmarking one and only one model, namely the MLPerf variant of RetinaNet. Therefore, we would miss out on objectively benchmarking other research Object Detection models, typically trained and validated on the COCO dataset. The idea was that a potential submitter would finetune RetinaNet on COCO too and thus provide a useful baseline figure for any comparisons on the alternative dataset.

We at KRAI actually did this for v2.1, measuring mAP=35.293% and publishing the finetuned model. This accuracy is lower than that of the reference model on OpenImages (mAP=37.55%), but much higher than, say, that of the deprecated SSD-ResNet34 model (mAP=20.00%). So a submitter showcasing their highly optimized SSD-ResNet34 implementation could legitimately claim that it is faster than RetinaNet, albeit less accurate.

This is not fool-proof, however. A submitter could spend minimal effort on finetuning (or not at all), presenting, for example, that RetinaNet achieves only mAP=10% on the COCO dataset. Then they could misleadingly claim that their optimized SSD-ResNet34 implementation is both faster than RetinaNet and more accurate.

Justifying the added text

When seeking such pre-approval, it is recommended that a potential submitter convincingly demonstrates the accuracy of the corresponding Closed model on the same validation dataset, which may involve retraining or finetuning the Closed model if required.

This is intended to avoid the above situation. At least, such a submitter would face scrutiny from the WG in the pre-approval stage :). They may get away with handwaving it through though :).

nv-ananjappa commented 10 months ago

@psyhtest This is perfect. Covers everything we wanted to change. LGTM.