kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.25k stars 436 forks source link

Figure <content> not rendered #722

Open awahl1 opened 3 years ago

awahl1 commented 3 years ago

When I run Grobid on this pdf, the contents of the cells of Table 1 are missing from the output xml.

The first problem seems to be that the table is being incorrectly classified as a figure. (This wouldn't be a huge problem on its own, because I think I could still correct the xml tags and add the file to a retraining set. However, with the contents entirely missing, this seems not possible.) The second problem is that the missing cell data is classified as figure "content" and stored in the content attribute of the Figure object; however, this attribute is never appended to figureElement in the toTEI method, so it never makes it into the final output.

When I add an if-block to do this appending, the missing contents do get included in the output xml:

Figure.java:

public class Figure() {
...
    public String toTEI(GrobidAnalysisConfig config, Document doc, TEIFormatter formatter) {
...
        if (content != null) {
            Element contentEl = XmlBuilderUtils.teiElement("content",
                LayoutTokensUtil.normalizeText(content.toString()));
            figureElement.appendChild(contentEl);
        }
kermitt2 commented 3 years ago

Hello @awahl1 !

Tables are represented in the XML as <figure> with attribute @type="table" (it's TEI, not my fault :) ).

For figure (real figure, not table), we don't have content normally but graphic objects (bitmaps in png and/or vector graphics).

For table we have the content in this content field, and a table analysis content method tries to create some table content structure.

Now in your example, apparently we have an error, Table 1 and Table 2 are recognized as one table. So all the content of Table 1 goes to Table 2. The caption of Table 1 is recognized as paragraph.

Figure and table recognition are not good currently but I hope to improve that part in the next months.