keras-team / keras-cv

Industry-strength Computer Vision workflows with Keras
Other
1.01k stars 330 forks source link

Tighten error margins in metrics/coco/numerical_tests #222

Closed LukeWood closed 2 years ago

LukeWood commented 2 years ago

currently the margins are 0.05. Ideally, this will be <0.01. This is likely caused by some errors in rounding, or a inherent logic difference from the original COCO metrics implementation.

If someone wants to perform a deep dive here and figure out how to reduce these error margins, the contribution will be welcomed.

This will likely be an extremely challenging task, good for a true TensorFlow veteran.

bhack commented 2 years ago

After fixing bbox format conversion we could achieve this partially- Please check https://github.com/keras-team/keras-cv/pull/246#discussion_r842005074

LukeWood commented 2 years ago

it goes down to 0.04~ approximately. One of the cases has a very high error margin, the rest around around 0.02. Quite strange.

bhack commented 2 years ago

What I mean is that we have two delta entries.

For a single test file it is solved as we could go 10x lower

bhack commented 2 years ago

@LukeWood Are bbox sorted before max_detections filtering? https://github.com/keras-team/keras-cv/blob/0131d55c1a77460ecde1ebc2602f915acf62f450/keras_cv/metrics/coco/recall.py#L143-L167

LukeWood commented 2 years ago

@LukeWood Are bbox sorted before max_detections filtering?

https://github.com/keras-team/keras-cv/blob/0131d55c1a77460ecde1ebc2602f915acf62f450/keras_cv/metrics/coco/recall.py#L143-L167

Wait a minute - it looks like no sorting is going on there anymore at all in recall! with max_detections we SHOULD be sorting on. Seems like a mistake on my end!

bhack commented 2 years ago

Wait a minute - it looks like no sorting is going on there anymore at all in recall! with max_detections we SHOULD be sorting on. Seems like a mistake on my end!

@LukeWood Do you have another ticket for this?

bhack commented 2 years ago

I suggest you to reopen this or create a new ticket to not waste the time we spent in debugging this. Thanks

LukeWood commented 2 years ago

we determined that the original implementation may not be correct in the first place, and that this issue is due to various discrepancies. I'm not sure any time is wasted as we did improve the metrics.

bhack commented 2 years ago

Yes, but this seems like a bug to me, not a different interpretation (as it was a bug the wrong bbox format interpretation). Other then these If we care about the reproducibility of a training process with respect to papers, we must keep the error low enough even supporting a possible wrong "reference version" that is used to write results tables in the papers.

LukeWood commented 2 years ago

Some comments from @bhack on this topic:

When I reimplement something I need to reproduce paper numbers. When we will have fixed the bugs we could reason about if we want to have a cocoeval compatibile flag or not if the drift Is still relevant.

LukeWood commented 2 years ago

We did extensive numerical benchmarking and published the results to arxiv at: https://arxiv.org/abs/2207.12120

LukeWood commented 2 years ago

For now this is fine