kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.4k stars 444 forks source link

Question about identifying table content #514

Open jribault opened 4 years ago

jribault commented 4 years ago

Hi,

I'm trying to identify patent (with country, number, date.... I have my own model) and it's working pretty well on text but not very well when patent are listed in table. Do you have any advice concerning table recognition ? Should I try to preprocess the text to remove the table ? Should I modify the template ?

I'm a bit stuck so any advice are welcome :)

Sunnycheey commented 4 years ago

My idea is: First detecting coordinates of tables by GROBID, and then using other tools like tabula to process the desired region.