kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.57k stars 457 forks source link

Optionally extract images for formulas, figures, tables, etc #438

Open de-code opened 5 years ago

de-code commented 5 years ago

Hi, would extracting images be considered part of the scope of GROBID?

e.g. current extraction of formulas, figures and tables is really bad as you know. Until we have a more confident extraction, it would be good if optionally images could be provide instead.

(#397 seems to be a related question)

kermitt2 commented 5 years ago

Hi! Yes I would say it's definitively in the scope of GROBID.

Right now it's possible to crop areas with pdfBox with the available generated coordinates, but I would like to add that functionality in pdfalto, because pdfbox is really too slow.

The recognition of formulas, figures and tables is really not that bad in term of precision (figures are pretty good, you can look at results at ResearchGate). Recall is an issue (too few training data for sure), however as we only recognize these object and not further structure them for the moment, just cropping them efficently as images would be really nice.

As discussed in #397 SVG images were not working well at some point, but I think that should be working again with the new pdfalto.

de-code commented 5 years ago

Yes, I meant the semantic extraction isn't good. The recognition seems to be okay. I think I don't have a good quantitative evaluation for that yet - although for figures and tables I am evaluating the label / description.

Extracting images would be a good start, even if not SVG to start with. At least the information wouldn't get lost.

An alternative would be to render the PDF pages and then cut the area out. Rendering could be done in parallel to parsing if all of the pages were rendered. Not sure how pdfbox or pdfalto would do it.

kermitt2 commented 5 years ago

There are evaluations of figure/table identification parsing but only at figure/table title level here: https://github.com/kermitt2/grobid/blob/master/grobid-trainer/doc/PMC_sample_1943.results.grobid-0.5.5-Glutton-17.05.2019#L236

Evaluation on description and content of the figure and table is hard with PMC because the encoding is not stable from one NLM/JATS document to another. Even for the title, sometimes the label/figure number is added or not, it's not very consistent, so the real accuracy is likely better.

I think that the information is never lost in the sense that you can get the coordinates for all relevant elements in the resulting TEI, and further crop the pdf with another tool (like pdfcrop command line).

If you want the embedded images and svg, it's already implemented in GROBID, it will extract everything as complementary files (with external pointers in the figures if I remember well) - except if ignoreAssets is used, for the service see https://github.com/kermitt2/grobid/issues/362#issuecomment-447518610

Cropping the PDF - which is what should be done here - can be made easily with PDFBox, we simply need to add something along these lines to the classes in the visualization subpackage (x,y,h,w being the bounding box coordinates of the object of choice):

            PDRectangle rectangle=new PDRectangle();
            rectangle.setLowerLeftX(x);
            rectangle.setLowerLeftY(y+h);
            rectangle.setUpperRightX(x+w);
            rectangle.setUpperRightY(y);

            page.setMediaBox(rectangle);
            page.setCropBox(rectangle);

But the problem of PDFBox is that it's super slow and it would obliterate all the other efforts in grobid to do something fast and scalable :)