Canonicalization of column header

yuyq96 commented 2 years ago

Hi, thanks for releasing the PubTables1M dataset. It took me a lot of time to clean the PubTabNet dataset, and the oversegmentation problem is probably the most tricky part. The release of PubTables1M not only increases the amount of data but also provides a good solution for the oversegmentation problem.

However, in Algorithm 1, step 10

for each cell in the column header do recursively merge the cell with any adjacent cells above and below in the column header that span the exact same columns

might lead to problems like:

Mistakenly merging nonblank cells in column header. For example, in PMC1064102_table_2:
- nonblank cell 1-1 (text: None), blank cell 2-1 and nonblank cell 3-1 (text: 3 (4)b) are merged. However, None and 3 (4)b are not semantically coherent and they correspond to different row headers (Addition and Gene), so we should only merge nonblank cell 1-1 (text: None) and blank cell 2-1.
- Similarly, blank cell 0-0, nonblank cell 1-0 (text: Addition...a), blank cell 2-0 and nonblank cell 3-0 (text: Gene) are merged, but we should only merge nonblank cell 1-0 (text: Addition...a) and blank cell 2-0.
The vanilla row just below the column header might be mistakenly recognized as a part of the column header, then it will be merged into the last row of the real column header under the rule of step 10. This might cause a significant mismatch between correct and wrong samples, since the visable border between the column header and the adjacent vanilla row is a strict rule for splitting cells. For example, in PMC1064102_table_0:
- nonblank cell 0-0 (text: RNA no.) and nonblank cell 1-0 (text: 1) are mistakenly merged.

* Sorry I can not upload the images since I am using the company's network.

Column oversegmentation usually occurs in top-aligned spanning cells with one or zero text line. Hence, it is helpful to merge (nonblank or blank) cell and blank cells below it, but I doubt that it is not worthwhile to merge nonblank cells.

Besides, errors caused by step 10 can not be easily corrected, maybe it should be removed from the algorithm?

yuyq96 commented 2 years ago

BTW, there is another confusing problem. For example, in PMC1079800_table_0, every column in the initial column header have exactly one complete cell that only spans that column, which arecolumn header cell 0-0 (text: Median age (range, yrs)) and column header cell 0-0 (text: 64 (60-73)). According to step 5, the while loop should be terminated immediately. However, the following three vanilla rows without any spanning cell are mistakenly recognized as a part of the column header, which is not consistent with the algorithm.

bsmock commented 2 years ago

Hi, thanks for the kind words, great questions, and great points raised.

I'll try to get to all of them eventually but for now I wanted to give you an explanation for the last example you brought up.

We derived PubTables-1M from a set of source HTML tables in the PMC dataset. Unlike most source table data, such as FinTabNet, this data did include a header annotation. In the case of PMC1079800_table_0 the original header annotation looks like this: <thead><tr><td align="left">Median age (range), yrs</td><td align="center">64 (60–73)</td></tr><tr><td align="left">Age category, n (%)</td><td></td></tr><tr><td align="left">  >60 and < 65 years</td><td align="center">100 (56)</td></tr><tr><td align="left">  ≥ 65 years</td><td align="center">80 (44)</td></tr></thead>

As you can see, in this case the original annotation is wrong with regards to the column header, which does happen quite a lot.

You're right that our algorithm for annotating headers would have only annotated the first row as being part of the header.

However, in creating PubTables-1M we tried to strike a balance between correcting and trusting the original annotations. So we only tried to extend the header to more rows, never reduce it to fewer rows. We do say this in the paper, but maybe we can make this clearer:

These assumptions enable us to determine if a row or column header is only partially annotated and if so, to extend it to additional columns or rows, respectively.

Why we only extend the header and never shrink it is we thought there was value in letting the original annotations override our simple algorithm in cases like these, in case there were edge cases we had not considered. Given that there will always be edge cases, it's not obvious if there would be more errors or fewer errors if we let our algorithm always determine the header and override the original annotators.

I hope this explains that example for you. Thanks again for the question!

yuyq96 commented 2 years ago

Hi, thanks for the explanation.

Now I am sure it is the wrong original annotation that leads to such cases. I agree with your strategy to decide when to trust the original annotations, which avoids introducing more errors and lets the data users decide whether to handle it or not.

Looking forward to your answer of the first question.

Thanks again!

microsoft / table-transformer

Canonicalization of column header #20