poloclub / unitable

UniTable: Towards a Unified Table Foundation Model
https://arxiv.org/abs/2403.04822
MIT License
350 stars 26 forks source link

Different number of cells and bbox #4

Closed valedica-core closed 3 months ago

valedica-core commented 5 months ago

Hi, thanks for sharing your work. I'm facing a problem that doesn't seem to be tackled by the library, even if kind of expected: what if the table structure extraction model outputs a different number of non-empty html cells compared to the table cell bbox detection model?

[reproducible example removed, since private data, thanks @ridhoalattas for yours!]

Did you encounter this problem before? Did you have some smart way to solve this problem?

ridhoalattas commented 5 months ago

do you found the solution? @valedica-core

ive exactly same case image

but got missmatch result image

ive already slicing the image every 200 pixel by height to avoid leaking gpu. the result is better indeed but got those missmatch result

please help @ShengYun-Peng

ShengYun-Peng commented 5 months ago

Thank you both for the question! I believe the above example is OOD for our training set as the spanning cell text is not aligned with any column in the header. A high-quality training dataset with abundatnt tables of various styles, colors, and designs will be ideal to resolve this.

How to perfectly align the cells predicted by the bbox branch and the structure branch is an open question now. Currently, UniTable divides the table parsing as structure+bbox+cell content. I hope there's a new way to divide and parse the table so that UniTable and the new method can double-check each other's outputs. Feel free to share your thoughts on this @valedica-core @ridhoalattas !

valedica-core commented 5 months ago

Thanks @ShengYun-Peng for the reply and again for this library.

I saw some other authors (2023 Nam Tuan Ly et al.) have proposed multi-task models to align a bit more the sub tasks, but I'd say the beauty of this library's approach is the simplicity. I wonder then why not employing a single decoder model that outputs both html tags and bbox at the same time, is it feasible?

About how I tried to fix the current issue, I first tried to frame it as a minimalization problem: find the best assignment of bboxes to html cells -possibly by skipping cells or splitting a cell to fit more than one box- that minimized the number of misplaced bboxes, where misplaced means that left-right or above-below relationships between bboxes are broken once you assign them to cells. However, with big tables and a lot of empty cells, the problem seems untreatable.

The simpler and drastic approach I ended up with is to just use the bboxes, drawing the axes of the grid using local minima of number of intersections with bboxes. It has some problem with spanning cells or columns, but I'm not that interested in it since my goal is to have lossy markdown representation as output.

ShengYun-Peng commented 4 months ago

Thanks @valedica-core ! It seems like table cell alignment is a good research direction. We tried GPT-4o for spanning cells, and UniTable still did better on table structure understanding. https://x.com/RealAnthonyPeng/status/1790431978829087123