kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.43k stars 443 forks source link

Consider integrating Tabula #340

Open de-code opened 6 years ago

de-code commented 6 years ago

Tabula seem to be actively developed and doesn't perform so badly (haven't done a quantitative analysis yet).

Would you consider integrating it?

kermitt2 commented 6 years ago

Hi Daniel !

Yes, we (Vincent from CDS Strasbourg @Vi-dot and me) studied the integration of Tabula.

We started with the idea to integrate ContentMine module for tables, but we failed to find a reliable way to convert PDF areas (the area identified as table by GROBID) into SVG.

So we then explore tabula-java. Unfortunately looking at the code, we saw that it is completely interlaced with pdfbox for all internal data structures. It mean concretely either:

That's our current state of though about internal table structure parsing...

de-code commented 6 years ago

Hi Patrice,

It is great that you two already had a look at it.

Regarding ContentMine, did you have success converting tables with it in general? It did seem a good candidate initially but may not be as actively developed (last time I tried on PKP's coaction dataset the heuristics didn't detect words correctly and there wasn't a table output - but I may have used it incorrectly). In my prototype processing, I was converting lxml (output of your pdftoxml) to SVG but I haven't tried to feed it into ContentMine. Just curious which difficulties you encountered? Detecting the correct table area?

For the table integration, I guess it depends how much time you have to keep maintaining it going forward. It might be okay for it to be a bit slower in that case (could be worth to have a configuration to turn it on or off as not everyone will be interested in tables). It might also be good to measure whether GROBID or Tabula is better at detecting tables or the table areas.

Some potential options to consider:

I guess that would make it easier to switch if it turns out another library becomes better at extracting tables.

Vitaliy-1 commented 5 years ago

Hi @kermitt2,

I've worked with the tabula branch, for better results, their Spreadsheet algorithm is more suitable for table extraction than the basic one: https://github.com/kermitt2/grobid/blob/tabula/grobid-core/src/main/java/org/grobid/core/data/Table.java#L154 The results are not bad: https://github.com/Vitaliy-1/GrobidSamples/blob/master/tabular.tei.xml#L607-L636 although sometimes tabula considers table title as a part of the table content. I saw the possible fix but it's not making the parsing results better. I think that it's rather tabula issue with table detection.

I see that the most code of tabula powers table detection, not table extraction, that is already done by Grobid. Have you considered to extract tables from PDFAlto directly? With some additional checks TextLine could help to determine a table cell and their positioning - table rows.

kermitt2 commented 5 years ago

Hi @Vitaliy-1 !

Nice integration and results indeed! Do we have some evaluation data for validating one algorithm against another? (as compared to the basic one integrated by @Vi-dot for instance)

What would be the gain of adding the table extraction in pdfalto directly? There is no explicit ALTO structure for handling table and all the useful information should be anyway available in GROBID.

I think the table "parsing" part covered by grobid would benefit a lot from more training data - there are really very few for the table model. The fulltext model is weak too in term of examples of tables, but before adding more training data to this model, I have planed to experiment with the new reading order of pdfalto and update the existing training data to this "new order".

Thanks a lot for your contribution, I think it would be already a great addition to have this kind of results integrated in GROBID.

Vitaliy-1 commented 5 years ago

Not a problem to process articles containing tables for the evaluation of these 2 tabula algorithms. I want also to experiment with detecting raws and cells from pdfalto. The gain here would be processing speed and opportunity not to rely on 3rd party software. I think adding information to the LayoutToken that is the last in TextBlock could ease the determination of the end of the row.

The branch with tabula integration: https://github.com/Vitaliy-1/grobid/tree/core_tabula

Vitaliy-1 commented 5 years ago

Hi @kermitt2,

I've looked through PDFAlto output contained in LayoutToken and implemented an algorithm based on line brakes and token positioning. But after testing I've noticed that it may be not enough for some cases, e.g., when rows are not consecutively parsed by PDFAlto/Xpdf line by line. Tabula-lattice algorithm usually fails to recognize table in such cases. It's probably related to how PDF was created.

So, if tabula would be not enough because of low accuracy, I think applying first a kind of sorting taking into account token's horizontal, vertical positioning and optimal distance between rows (+- some margin values) should at least partially solve the problem for problematic cases.

kermitt2 commented 5 years ago

Hello!

Are you referring to end-of-line after each cell content? So, instead of having one cell content after each other for a row on the same line, we have one cell content per line? I've seen this case quite frequently and it's due to the pdf stream. However we could certainly correct that in pdfalto based on the fact the all these tokens have the same baseline. It's one of the improvement related to better handling reading order.

Then it makes things easier to group tokens in GROBID - this would be less dependent of the particular output of pdfalto, and in case the ALTO file corresponding to the fixed layout comes from another tool (OCR or maybe in a future docx to ALTO converter), we can certainly apply in GROBID the same table structuring for all of them.

Vitaliy-1 commented 5 years ago

Yes, what I've noticed is that end-of-line occurs after each cell or after each line in the cell content (if the cell content has several lines). Then we have another cell in a row parsed. This is good because it helps to determine cells and it's actually what I've used: https://github.com/Vitaliy-1/grobid/blob/core_cells/grobid-core/src/main/java/org/grobid/core/data/Table.java#L267

I have very rarely encountered situations when two cells are not delimited by an end-of-line symbol. This actually was one case with PDF created from DOCX. But more often, rather moving to the next cell in a row, the content of the next cell in a column is parsed. E.g., it's the case for PDF from this article: https://www.banglajol.info/index.php/JPharma/article/view/228 The output starting from a second row looks like this:

<TextBlock ID="p4_b10" HPOS="101.1" VPOS="121.305" HEIGHT="80.4602" WIDTH="3.99">
    <TextLine WIDTH="3.99" HEIGHT="6.95856" ID="p4_t13" HPOS="101.1" VPOS="121.305">
        <String ID="p4_w21" CONTENT="1" HPOS="101.1" VPOS="121.305" WIDTH="3.99" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="3.99" HEIGHT="6.95856" ID="p4_t14" HPOS="101.1" VPOS="130.485">
        <String ID="p4_w22" CONTENT="2" HPOS="101.1" VPOS="130.485" WIDTH="3.99" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="3.99" HEIGHT="6.95856" ID="p4_t15" HPOS="101.1" VPOS="139.666">
        <String ID="p4_w23" CONTENT="3" HPOS="101.1" VPOS="139.666" WIDTH="3.99" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="3.99" HEIGHT="6.95856" ID="p4_t16" HPOS="101.1" VPOS="148.846">
        <String ID="p4_w24" CONTENT="4" HPOS="101.1" VPOS="148.846" WIDTH="3.99" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="3.99" HEIGHT="6.95856" ID="p4_t17" HPOS="101.1" VPOS="158.086">
        <String ID="p4_w25" CONTENT="5" HPOS="101.1" VPOS="158.086" WIDTH="3.99" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="3.99" HEIGHT="6.95856" ID="p4_t18" HPOS="101.1" VPOS="167.266">
        <String ID="p4_w26" CONTENT="6" HPOS="101.1" VPOS="167.266" WIDTH="3.99" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="3.99" HEIGHT="6.95856" ID="p4_t19" HPOS="101.1" VPOS="176.446">
        <String ID="p4_w27" CONTENT="7" HPOS="101.1" VPOS="176.446" WIDTH="3.99" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="3.99" HEIGHT="6.95856" ID="p4_t20" HPOS="101.1" VPOS="185.626">
        <String ID="p4_w28" CONTENT="8" HPOS="101.1" VPOS="185.626" WIDTH="3.99" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="3.99" HEIGHT="6.95856" ID="p4_t21" HPOS="101.1" VPOS="194.807">
        <String ID="p4_w29" CONTENT="9" HPOS="101.1" VPOS="194.807" WIDTH="3.99" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
</TextBlock>
<TextBlock ID="p4_b11" HPOS="153.12" VPOS="121.307" HEIGHT="80.4602" WIDTH="8.01032">
    <TextLine WIDTH="7.54748" HEIGHT="6.95856" ID="p4_t22" HPOS="153.36" VPOS="121.307">
        <String ID="p4_w30" CONTENT="1a" HPOS="153.36" VPOS="121.307" WIDTH="7.54748" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="8.01032" HEIGHT="6.95856" ID="p4_t23" HPOS="153.12" VPOS="130.487">
        <String ID="p4_w31" CONTENT="1b" HPOS="153.12" VPOS="130.487" WIDTH="8.01032" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="7.54748" HEIGHT="6.95856" ID="p4_t24" HPOS="153.36" VPOS="139.667">
        <String ID="p4_w32" CONTENT="1c" HPOS="153.36" VPOS="139.667" WIDTH="7.54748" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="8.01032" HEIGHT="6.95856" ID="p4_t25" HPOS="153.12" VPOS="148.848">
        <String ID="p4_w33" CONTENT="1d" HPOS="153.12" VPOS="148.848" WIDTH="8.01032" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="7.54748" HEIGHT="6.95856" ID="p4_t26" HPOS="153.36" VPOS="158.088">
        <String ID="p4_w34" CONTENT="1e" HPOS="153.36" VPOS="158.088" WIDTH="7.54748" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="6.68397" HEIGHT="6.95856" ID="p4_t27" HPOS="153.78" VPOS="167.268">
        <String ID="p4_w35" CONTENT="1f" HPOS="153.78" VPOS="167.268" WIDTH="6.68397" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="8.01032" HEIGHT="6.95856" ID="p4_t28" HPOS="153.12" VPOS="176.448">
        <String ID="p4_w36" CONTENT="1g" HPOS="153.12" VPOS="176.448" WIDTH="8.01032" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="8.01032" HEIGHT="6.95856" ID="p4_t29" HPOS="153.12" VPOS="185.628">
        <String ID="p4_w37" CONTENT="1h" HPOS="153.12" VPOS="185.628" WIDTH="8.01032" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
    <TextLine WIDTH="6.22528" HEIGHT="6.95856" ID="p4_t30" HPOS="154.02" VPOS="194.809">
        <String ID="p4_w38" CONTENT="1i" HPOS="154.02" VPOS="194.809" WIDTH="6.22528" HEIGHT="6.95856"
                STYLEREFS="font6"/>
    </TextLine>
</TextBlock>

So, here the parsing is done per column basis, rather than row by row. It's where my current algorithm and (often) tabula-lattice fails. Regarding the former, I thought that prior sorting can solve the problem.

Also, there can be situations when differences in cell's content vertical positioning (I mean the first line of a cell) can affect parsing order. But that's more or less manageable.

Hope it makes sense :)

kermitt2 commented 5 years ago

Yes it's very clear and in line with what I observed. I will dive back this week-end in pdfalto to see how to refine the end-of-line event in these cases, and to exploit more spacial layout rather that pdf stream.