Open yuyq96 opened 2 years ago
BTW, there is another confusing problem. For example, in PMC1079800_table_0
, every column in the initial column header have exactly one complete cell that only spans that column, which arecolumn header cell 0-0 (text: Median age (range, yrs))
and column header cell 0-0 (text: 64 (60-73))
. According to step 5, the while loop should be terminated immediately. However, the following three vanilla rows without any spanning cell are mistakenly recognized as a part of the column header, which is not consistent with the algorithm.
Hi, thanks for the kind words, great questions, and great points raised.
I'll try to get to all of them eventually but for now I wanted to give you an explanation for the last example you brought up.
We derived PubTables-1M from a set of source HTML tables in the PMC dataset. Unlike most source table data, such as FinTabNet, this data did include a header annotation. In the case of PMC1079800_table_0
the original header annotation looks like this:
<thead><tr><td align="left">Median age (range), yrs</td><td align="center">64 (60–73)</td></tr><tr><td align="left">Age category, n (%)</td><td></td></tr><tr><td align="left">  >60 and < 65 years</td><td align="center">100 (56)</td></tr><tr><td align="left">  ≥ 65 years</td><td align="center">80 (44)</td></tr></thead>
As you can see, in this case the original annotation is wrong with regards to the column header, which does happen quite a lot.
You're right that our algorithm for annotating headers would have only annotated the first row as being part of the header.
However, in creating PubTables-1M we tried to strike a balance between correcting and trusting the original annotations. So we only tried to extend the header to more rows, never reduce it to fewer rows. We do say this in the paper, but maybe we can make this clearer:
These assumptions enable us to determine if a row or column header is only partially annotated and if so, to extend it to additional columns or rows, respectively.
Why we only extend the header and never shrink it is we thought there was value in letting the original annotations override our simple algorithm in cases like these, in case there were edge cases we had not considered. Given that there will always be edge cases, it's not obvious if there would be more errors or fewer errors if we let our algorithm always determine the header and override the original annotators.
I hope this explains that example for you. Thanks again for the question!
Hi, thanks for the explanation.
Now I am sure it is the wrong original annotation that leads to such cases. I agree with your strategy to decide when to trust the original annotations, which avoids introducing more errors and lets the data users decide whether to handle it or not.
Looking forward to your answer of the first question.
Thanks again!
Hi, thanks for releasing the PubTables1M dataset. It took me a lot of time to clean the PubTabNet dataset, and the oversegmentation problem is probably the most tricky part. The release of PubTables1M not only increases the amount of data but also provides a good solution for the oversegmentation problem.
However, in Algorithm 1, step 10
might lead to problems like:
PMC1064102_table_2
:nonblank cell 1-1 (text: None)
,blank cell 2-1
andnonblank cell 3-1 (text: 3 (4)b)
are merged. However,None
and3 (4)b
are not semantically coherent and they correspond to different row headers (Addition
andGene
), so we should only mergenonblank cell 1-1 (text: None)
andblank cell 2-1
.blank cell 0-0
,nonblank cell 1-0 (text: Addition...a)
,blank cell 2-0
andnonblank cell 3-0 (text: Gene)
are merged, but we should only mergenonblank cell 1-0 (text: Addition...a)
andblank cell 2-0
.PMC1064102_table_0
:nonblank cell 0-0 (text: RNA no.)
andnonblank cell 1-0 (text: 1)
are mistakenly merged.* Sorry I can not upload the images since I am using the company's network.
Column oversegmentation usually occurs in top-aligned spanning cells with one or zero text line. Hence, it is helpful to merge (nonblank or blank) cell and blank cells below it, but I doubt that it is not worthwhile to merge nonblank cells.
Besides, errors caused by step 10 can not be easily corrected, maybe it should be removed from the algorithm?