sohaib023 / Tab-Aug

Code for augmenting document table images.
2 stars 1 forks source link

About the format of XML files #1

Open BangdongChen opened 3 years ago

BangdongChen commented 3 years ago

Hi, in line 98 of _generatesamples.py, I found that doc is nothing after _Document(xmlfile), is there something wrong with the format of the xml files? Can you give a xml file as an example?

sohaib023 commented 3 years ago

There is a possibility of some problems with the code. (I'll advise you to use the version 1.0 of package truthpy, as there were some changes made to truthpy after it which were not backwards compatible). Apart from that, please share the XML file over here and I can have a look into it.

sohaib023 commented 3 years ago

I have attached an example ground truth file which can be used. As github did not allow me to upload XML so I changed the extension to txt. Do let me know if that clears up the issue. us-005_0_0.txt

andyjiang1116 commented 1 year ago

i want to know how to generate the corresponding xml file if i use the pubtablenet dataset

sohaib023 commented 1 year ago

@andyjpaddle I don't think you can convert the annotations for pubtablenet to the xml files required by this repo. I believe pubtablenet only has HTML annotations, i.e. there is no annotation regarding locations of the rows and columns in the image (which is needed for Tab-Aug). The only solution would be to manually annotate the images using T-Truth.

andyjiang1116 commented 1 year ago

most of table recognition datasets are labeled with text bbox and structure tokens, however TabAug need the cell bbox, in other words it can not be applied. i wonder is there a plan to support public dataset (eg, pubtabnet)?