Evaluation Process - Githubissues

Louis-Dupont commented 1 year ago

Hi,

First, thanks a lot for this r100 benchmark, it's really great!

I would like to run it on my custom model, and I did not find a step-by-step explanation on how to do (did I miss it?). There are 2 points that I would like to clarify concerning the benchmark evaluation process.

1.

From what I understood, the evaluation process is as follows:

Each model (in your benchmark yolov5 and yolov7) is trained and evaluated on each dataset separately.
You aggregate the results by category to create this table:

Which means that YOU DON'T:

Start by aggregating the data into a single dataset
Train the model only once (on the whole rf100 data)
Only then evaluate per category

Did I understand correctly?

2.

How do you aggregate by category? Do you:

Compute MAP per dataset, and then compute the simple average over the datasets of the category?
Compute MAP per dataset, and then compute the weighted average over the datasets of the category, with weight being the number of samples?
Compute the prediction per dataset, then concatenate all the predictions and compute the MAP only once?
Something else?

Thanks a lot 🙏

Jacobsolawetz commented 1 year ago

Hello @Louis-Dupont! Thanks for reaching out

Yes you are spot on in understanding of 1.

Compute MAP per dataset, and then compute the simple average over the datasets of the category

We have had ideas to work on making models that shift tasks and keep older tasks in memory - do you have a particular approach in mind?

Louis-Dupont commented 1 year ago

Thanks a lot for your answer, I never did something like that myself to be honest, so I don't want to just make up approaches :)

roboflow / roboflow-100-benchmark

Evaluation Process #44

1.

2.