Closed buptxiaofeng closed 2 years ago
If our goal is to evaluate how well does DETR predict bounding boxes, then to your point, of course we want to use bounding boxes exactly as the DETR model predicts them. That's what we do when we report the object detection metrics (AP, AP50, etc.).
But to evaluate TSR, we want to know how well the system actually does solving the task of extracting a table. This is what every TSR metric is intended to measure.
The confusion might lie in what is referred to by the word "model". In our system, the TSR task is solved using a learned/trained DETR model and un-learned/un-trained post-processing steps. The overall TSR system is designed to combine DETR's predicted bounding boxes with text bounding boxes that are determined or supplied separately from DETR---this combination step, which implicitly refines the predicted bounding boxes from DETR, is what we refer to as post-processing. The un-learned post-processing steps have to be included when evaluating the TSR task. So, the combined system is what we are evaluating for TSR---not just the DETR model in isolation.
In fact, the way we design the system to use object detection (from DETR) actually means that the post-processed bounding boxes are the only reasonable thing to use to evaluate TSR.
This is because we ask DETR to predict what we call "dilated" rows and columns, which intentionally include whitespace that is outside of the true row and column. It wouldn't be meaningful to evaluate these dilated bounding boxes for TSR because the additional whitespace has no effect on the extraction outcome.
(As an aside, if you dig further into the code you'll see that we do include the option to pretend the dilated bounding boxes are the table ground truth and compute a GriTS score based on this that ignores the text bounding boxes. But this metric would not make for any meaningful comparison except with systems that also use dilated bounding boxes exactly as we do).
The issue with the dilated bounding boxes not being appropriate for evaluating TSR is mentioned in our paper. But I hope this answer clarifies things a bit.
If there's anything you'd like to discuss further, I recommend we close this issue first and move it to the discussions section.
Thanks for your reply. This is really helpful. I close this issue.
Hi, I read the postprocessing code and find that PubTables1M-Table-Words-JSON information is used to refine the columns and rows when evaluating the TSR model performance, but I think these words information is not part of model outputs. Is it reasonable to use these information to evaluate the model?