ibm-aur-nlp / PubLayNet

Other
900 stars 165 forks source link

Is it possible to get the annotation for each textline? #4

Open chixma opened 4 years ago

chixma commented 4 years ago

Thank you for sharing the dataset! Well, it would be convenient if we can utilize the annotations for each textline, including the corresponding bbox and text, especial for logical layout analysis tasks.

zhxgj commented 4 years ago

For textlines, I think you can use PDFminer to get them directly from the PDFs. Some post processing will be needed to curate the textlines, or you can discard the pages with fragmented textlines.

YueshangGu commented 3 years ago

Hi, @zhxgj ,Thank for your suggestion. I'm working on joint text detection and page of detection, so I need correct textlines' bbox. I have used PDFMiner to parse pdfs and get all textlines' annotations(including bbox and text). But I found there is a mismatch between textline's bbox and block level annotation's bbox, especially when there is only one textline in the text block. I also found the textlines' bbox parsed by PDFMiner is wrong when there are some mathematics symbols in the textlines(the bbox's upper line is higher than block level bbox's). So could you tell me about the post processing you used and the LAParams when you use PDFMiner? Thank you.

zhxgj commented 3 years ago

Hi @YueshangGu , we did have some post processing to improve the textlines extracted by PDFminer. The main processing is to merge some textlines by overlapping and distance to better handle math symbols. In LAParams, we also used higher values than the default to capture more complete textlines.

YueshangGu commented 3 years ago

@zhxgj Thank you! Is it possible to release all textlines' annotations of train set and val set? Or release the codes to generate textlines' annotation?