mlcommons / tiny

MLPerf™ Tiny is an ML benchmark suite for extremely low-power systems such as microcontrollers
https://mlcommons.org/en/groups/inference-tiny/
Apache License 2.0
344 stars 81 forks source link

Where is the Visual Wake Word test set? #135

Open LucaUrbinati44 opened 1 year ago

LucaUrbinati44 commented 1 year ago

I would like to evaluate the pretrained MobileNet model on the preprocessed COCO2014 test set, but I am not able to find this preprcessed test set anywhere in the repo. Where can I find it? For the other three datasets (AD, IC, KS) it has been already provided in the repo.

I suspect I have to generate it by myself using this script setting dataType='test2014', because this should be the same script that has been used to create the training+validation dataset that is used for the training and that can be downloaded here.

Moreover, the paper entitled "MLPerf Tiny Benchmark" mentions the presence of this test set for the VWW problem at paragraph 4.1.

Finally, why is there no test.py (or evaluated.py) script to run the model on the test set, while for all the other three datasets (AD, IC, KS) there are such scripts?

Thank you, Regards, Luca Urbinati

colbybanbury commented 1 year ago

Good question!

MS-COCO does not publish the labels (aka annotations) for the test set and holds competitions oriented around the test set. This means that Visual Wake Words does not contain an explicit test set.

It's traditionally best practice to use the Val set as the test set and use a small percentage of the training set for validation if needed. MLPerf Tiny should potentially move to adopt this practice, including an update to the paper.

@cskiraly and @jeremy-syn, who currently owns the VWW benchmark? I'm happy to help make the change if needed.

LucasFischer123 commented 1 year ago

Hi @colbybanbury @LucaUrbinati44

Any news on this issue ?

Thanks

Lucas

LucaUrbinati44 commented 1 year ago

Hi @LucasFischer123,

Short answer We "solved" it by using 10% of the whole dataset as "validation set" during training (according to the train_vww.py script) and then using these 1000 images for testing.

Long answer We discovered that these 1000 images correspond to 1000 images of the provided dataset. So, as first experiment, we removed those 1000 images from the dataset and we used the remaining dataset to train a floating point model from scratch using train_vww.py (without changing anything in this training script) and then we made inference on the 1000 images for testing. The result was around 83%, smaller than the 86% mentioned in the paper.

Then, as second experiment, we trained the model again from scratch, but this time on the whole dataset, i.e. without removing the 1000 images. This time the testing result on the 1000 images was 86%, as the paper.

Since the second experiment gave the same results of the paper, we decided to go for this second “solution” (see “Short answer”).

However, we know that this procedure is not 100% correct since the model saw the 1000 images twice (during training and during testing).

Thus, we hope the organizers' could solve this issue soon, both in the repo instructions and in the paper.

Thank you all, Luca Urbinati and Marco Terlizzi

NilsGraf commented 1 year ago

Hi @LucaUrbinati44 @colbybanbury @LucasFischer123 @cskiraly and @jeremy-syn

I had a similar question on how to evaluate accuracy. I created this Jupyter notebook, which you can run in your browser (or use this script if you prefer running locally).

This script downloads the dataset from Silabs and runs both TFLite reference models (int8-model and float-model) with the 1000 images listed in y_labels.csv to measure their accuracy. I get below results:

float accuracy: 85.2   
int8 accuracy : 85.9  
image count   : 1000  

Does this look correct?

BTW, I get 86.0% for int8 accuracy (instead of 85.9%) when I run on M1 MacBook instead of colab.

NilsGraf commented 1 year ago

One more note: For the int8-accuracy, a few of the testcases in y_labels.csv produce a probability of exactly 0.5 (i.e. signed int8 value of 0, or unsigned int8 value of 128). In my script I assume that probability-of-person = 0.5 indicates a person. Changing this to non-person reduces the int8-accuracy by 0.3%.