kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.43k stars 443 forks source link

Adding support for other PDF to XML engines #33

Open philgooch opened 10 years ago

philgooch commented 10 years ago

I'm adding support for PDFMiner (https://github.com/euske/pdfminer) and PDFBox (https://pdfbox.apache.org) in my fork. So far, all the tests still pass when using PDFMiner . Will issue a pull request after further testing and cleanup.

It may be an idea to decouple the PDF extraction entirely, and have the option to import raw XML or JSON directly (e.g. created by a previous process, or from a filestore).

de-code commented 7 years ago

Has there ever been a PR for this?

kermitt2 commented 7 years ago

Yes, although I can't find it any more.

The problem was that the training data depends on the PDF extraction library, I think it is not possible to decouple the PDF extraction part. PDF extraction libraries stream differently the PDF elements (e.g. following the PDF element order, recompose a reading order, etc.).

The consequence would have been to manage different training set depending on the PDF extraction library or to find a way to realign extraction in a kind of normal form, and I am not sure it is doable.

I like the idea however so I keep the issue open :)

de-code commented 7 years ago

It appears some people spend years researching just one aspect of the PDF conversion, like text block detection, proper reading order, formulas etc. it will be difficult for one tool to cover it all perfectly. It would be good if researchers could improve one puzzle of the tool chain.

Like in this case the input to Grobid could focus on extracting references and perhaps even some Figures. Could the input not be something like the output of pdf2xml?

kermitt2 commented 7 years ago

I think you're mixing a bit two different aspects. Developers/integrators can use different tools like Grobid extracting different complementary structures or similar ones, and merge these outputs according to their requirements. So they can take advantage of different specialised tools in a flexible manner without putting heavy integration requirements on each tool (that will in practice never been implemented by research prototypes).

Chaining tools is much more complicated because each tool might consider different intermediary process and data structures adapted to their approaches. Of course the idea is nice on paper, but in practice it creates more problems than it solves. For instance in the case of pdf extraction libraries, as the output is different from one library to another one, either we create complicated transformation/normalisation layers (that I don't know how to implement), and/or we use the smallest common denominator of all the possible inputs and we would loose many interesting features to help structuring (so we loose performance).

In addition, another practical aspect, there are today two libraries which are superior to the others in term of robustness and speed (muPDF and xpdf), so supporting all the other libraries is more an academic exercise which would be more over-engineering than providing practical benefits.

de-code commented 7 years ago

I can see how I could merge Grobid's output with the output of other tools, say something that looks at formulas (as long as there is enough information to link them back, e.g. where the formula should be placed within the Grobid extracted text).

But like in https://github.com/kermitt2/grobid/issues/174 - there are paragraph / reading order detection issues (there are others). And I am not sure how to best solve it. Because at Grobid's output it would be difficult to determine that and fix it. I think that should happen either at the input or within Grobid. Doing it at the input seems reasonable to allow other tools to focus on that.

I guess one way for me to achieve that would be to create a new PDF with a simple reading order and easily identifiable spacing. The advantage is that this is transparent. The disadvantage is that it is less easy to inspect. And Grobid could probably be simpler if it didn't have to deal with PDFs.

I don't really mind using muPDF's or xpdf's output as long as it contains enough information. (Not sure if muPDF attempts to determine cross-page reading order?) Or it could be a Grobid specific format with one converter for the preferred tool (PDFBox if you like). I don't think you need to provide support for multiple tools out of the box - other people can provide that. You could internally not serialize it so it shouldn't hit the performance. Clear boundaries could be a good thing. It may already exist internally (I haven't read every single line of code).

I am approach it a bit naively perhaps (I certainly was a bit naive when I initially thought 'how hard can it be to convert PDFs').

What would be your advice on how to address the paragraph / reading order detection issues? (a chat via something like Slack could be good)

kermitt2 commented 7 years ago

Please send me your email address, I'll invite you on the slack channel dedicated to GROBID development!

The important point is that the PDF library has an impact on the training data because there's a need for aligning PDF output and annotations. I don't think it is possible to be independent from the PDF library and to ignore PDF parsing in Grobid - as far I've explored the problem and different PDF libraries.

For addressing reading order and other PDF input issues, I've worked on a fork of pdf2xml, which introduce, among other changes, a partial support for reader order to fix these kind of issues. See https://github.com/kermitt2/pdf2xml However integrating this fork will take some time because the training data will need to be refreshed, as I mentioned above.

The simplification of PDF is more or less equivalent to what pdf2xml is doing (xpdf implements also different PDF stream processing order, including an estimation of the "natural" reading order but it introduces other errors). I think there is a lot of room of improvement on PDF parsing libraries and many errors in Grobid will only be fixed by addressing them in the PDF libraries which is why I've started working on pdf2xml directly (it is actually very close to xpdf).

I am also experimenting text normalization via LSTMs encoder/decoder for other kind of problems directly coming from the PDF like diacritics/recomposing characters and correct spacing (this is also dependent on the used PDF library).

Just a final note, I am not using PDFBox for PDF parsing, it is not robust enough (for various reason) and slower than xpdf or muPDF.