Figure <content> not rendered

kermitt2 / grobid

A machine learning software for extracting information from scholarly documents

Apache License 2.0

3.25k stars 436 forks source link

When I run Grobid on this pdf, the contents of the cells of Table 1 are missing from the output xml.

The first problem seems to be that the table is being incorrectly classified as a figure. (This wouldn't be a huge problem on its own, because I think I could still correct the xml tags and add the file to a retraining set. However, with the contents entirely missing, this seems not possible.) The second problem is that the missing cell data is classified as figure "content" and stored in the content attribute of the Figure object; however, this attribute is never appended to figureElement in the toTEI method, so it never makes it into the final output.

When I add an if-block to do this appending, the missing contents do get included in the output xml:

Figure.java:

public class Figure() {
...
    public String toTEI(GrobidAnalysisConfig config, Document doc, TEIFormatter formatter) {
...
        if (content != null) {
            Element contentEl = XmlBuilderUtils.teiElement("content",
                LayoutTokensUtil.normalizeText(content.toString()));
            figureElement.appendChild(contentEl);
        }

kermitt2 / grobid

Figure <content> not rendered #722