Closed LukeWood closed 2 years ago
After fixing bbox format conversion we could achieve this partially- Please check https://github.com/keras-team/keras-cv/pull/246#discussion_r842005074
it goes down to 0.04~ approximately. One of the cases has a very high error margin, the rest around around 0.02. Quite strange.
What I mean is that we have two delta entries.
For a single test file it is solved as we could go 10x lower
@LukeWood Are bbox sorted before max_detections filtering? https://github.com/keras-team/keras-cv/blob/0131d55c1a77460ecde1ebc2602f915acf62f450/keras_cv/metrics/coco/recall.py#L143-L167
@LukeWood Are bbox sorted before max_detections filtering?
Wait a minute - it looks like no sorting is going on there anymore at all in recall! with max_detections we SHOULD be sorting on. Seems like a mistake on my end!
Wait a minute - it looks like no sorting is going on there anymore at all in recall! with max_detections we SHOULD be sorting on. Seems like a mistake on my end!
@LukeWood Do you have another ticket for this?
I suggest you to reopen this or create a new ticket to not waste the time we spent in debugging this. Thanks
we determined that the original implementation may not be correct in the first place, and that this issue is due to various discrepancies. I'm not sure any time is wasted as we did improve the metrics.
Yes, but this seems like a bug to me, not a different interpretation (as it was a bug the wrong bbox format interpretation). Other then these If we care about the reproducibility of a training process with respect to papers, we must keep the error low enough even supporting a possible wrong "reference version" that is used to write results tables in the papers.
Some comments from @bhack on this topic:
When I reimplement something I need to reproduce paper numbers. When we will have fixed the bugs we could reason about if we want to have a cocoeval compatibile flag or not if the drift Is still relevant.
We did extensive numerical benchmarking and published the results to arxiv at: https://arxiv.org/abs/2207.12120
For now this is fine
currently the margins are 0.05. Ideally, this will be <0.01. This is likely caused by some errors in rounding, or a inherent logic difference from the original COCO metrics implementation.
If someone wants to perform a deep dive here and figure out how to reduce these error margins, the contribution will be welcomed.
This will likely be an extremely challenging task, good for a true TensorFlow veteran.